In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno (The director of marketing) has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends. Moreno wants our team to answer: How do annual members and casual riders use Cyclistic bikes differently?
The business question assigned to me is quite clear: “How do annual members and casual riders use Cyclistic bikes differently?” Cyclistic Company seeks to understand the main differences between these two user groups - casual riders (who pay for each ride) and annual members (who pay for a yearly subscription) - through data analysis. To address this question, I will examine Cyclistic’s Rides data to gain insights into the distinct usage patterns of members and casual riders. Additionally, if there is sufficient data available, I will explore the factors that motivate casual riders to purchase Cyclistic annual memberships and consider how digital media can be utilized to influence them to become members. If data is insufficient, I will prioritize addressing these questions in subsequent analyses by collecting additional relevant data.
To approach this business question effectively, I will consider both the background information highlighting the higher profitability of annual members compared to casual riders and the dataset provided, which includes valuable information such as bike types, total number of rides, ride months, ride days, ride hours, latitude and longitude of stations, and member-causal type. Based on this information, I will formulate several hypotheses to guide my analysis:
H1: Members use bikes more often than casuals.
H2: Members’ average ride duration is higher than casuals’.
H3: Member and casuals are significantly different in bike type choice.
H4: Due to the weather in Chicago, both members and casuals prefer to use bikes in summer than winter.
H5: Members’ total rides per month are higher than casuals.
H6: Members’ total ride duration per month is higher than casuals.
H7: Members are more active on weekdays rather than weekends, casual riders are highly active on weekends.
H8: Members’ total ride duration is higher than casuals on weekdays while lower than casuals on weekends.
H9: Members’ average ride duration is higher than casuals on weekdays while lower than casuals on weekends.
H10: Members have a ride peak at 8 mornings and at 17 afternoons during weekdays while casual users tend to ride more in the afternoon during weekends.
H11: Members’ total rides per hour of the day are higher than casuals.
H12:Members’ average ride duration per hour of the day is higher than casuals.
H13: Members’ starting and ending locations are clustered in the downtown area, while casual customers’ locations are more dispersed and along the beach.
The goal of this project is to get insights that could support the marketing director’s strategy. My stakeholder in this project is the marketing director and I will report to her directly.
The dataset provided for this case study project includes 39 ‘.csv’ files representing each month from April 2020 to June 2023 and is located on the company’s cloud storage (Amazon Web Services : https://divvy-tripdata.s3.amazonaws.com/index.html).
I download these files to my local folder. As a practice, I plan to use R to process and analyse these data. To input these data into R, I setup my worddirectory with code in chunk 1.
Each table uses 13 columns (“ride_id”, “rideable_type”, “started_at”, “ended_at”, “start_station_name”, “start_station_id”, “end_station_name”, “end_station_id”, “start_lat”, “start_lng”, “end_lat”, “end_lng”, “member_casual”) to record each month’s rides.
Further inspection of the files showed that there are multiple null values across tables, also columns have inconsistent naming and formatting that causes duplicates and the inability to merge tables without additional formatting.
Thus, to prepare data for cleaning, we need to change the type of two columns (start_station_id, end_station_id) of the first 8 months to match the most recent data.
Now that we have all the data in one place saved as a new virtual table, we can start to clean the data off possible inconsistencies and/or errors. We will also add new columns based on existing data calculations to get more in-depth insights about user behaviour.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(skimr) # Summarize data
library(readr) # Import data
library(janitor) # Cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(leaflet)
library(htmlwidgets)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(htmltools)
library(leaflet.extras)
library(geosphere)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
## (status 2 uses the sf package in place of rgdal)
##
## Attaching package: 'geosphere'
## The following object is masked from 'package:htmltools':
##
## span
library(leafsync)
library(mapview)
library(tinytex)
getwd()
## [1] "/Users/rachel_sun/Desktop/divvy"
divvy_202004 <- read.csv("202004-divvy-tripdata.csv")
divvy_202005 <- read.csv("202005-divvy-tripdata.csv")
divvy_202006 <- read.csv("202006-divvy-tripdata.csv")
divvy_202007 <- read.csv("202007-divvy-tripdata.csv")
divvy_202008 <- read.csv("202008-divvy-tripdata.csv")
divvy_202009 <- read.csv("202009-divvy-tripdata.csv")
divvy_202010 <- read.csv("202010-divvy-tripdata.csv")
divvy_202011 <- read.csv("202011-divvy-tripdata.csv")
divvy_202012 <- read.csv("202012-divvy-tripdata.csv")
divvy_202101 <- read.csv("202101-divvy-tripdata.csv")
divvy_202102 <- read.csv("202102-divvy-tripdata.csv")
divvy_202103 <- read.csv("202103-divvy-tripdata.csv")
divvy_202104 <- read.csv("202104-divvy-tripdata.csv")
divvy_202105 <- read.csv("202105-divvy-tripdata.csv")
divvy_202106 <- read.csv("202106-divvy-tripdata.csv")
divvy_202107 <- read.csv("202107-divvy-tripdata.csv")
divvy_202108 <- read.csv("202108-divvy-tripdata.csv")
divvy_202109 <- read.csv("202109-divvy-tripdata.csv")
After importing the 12 files, I compared the naming conventions of the columns to check compatibility for a full union.
colnames(divvy_202004)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202005)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202006)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202007)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202008)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202009)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202010)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202011)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202012)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202101)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202102)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202103)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202104)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202105)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202106)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202107)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202108)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(divvy_202109)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Given that the column names are identical, I will use the compare_df_cols() function to probe deeper for any inconsistencies in data formatting. I have requested R to return any data type mismatches between the tables.
compare_df_cols(divvy_202004, divvy_202005,divvy_202006, divvy_202007, divvy_202008, divvy_202009, divvy_202010, divvy_202011, divvy_202012, divvy_202101, divvy_202102,divvy_202103, divvy_202104, divvy_202105, divvy_202106, divvy_202107, divvy_202108, divvy_202109, return = "mismatch")
## column_name divvy_202004 divvy_202005 divvy_202006 divvy_202007
## 1 end_station_id integer integer integer integer
## 2 start_station_id integer integer integer integer
## divvy_202008 divvy_202009 divvy_202010 divvy_202011 divvy_202012 divvy_202101
## 1 integer integer integer integer character character
## 2 integer integer integer integer character character
## divvy_202102 divvy_202103 divvy_202104 divvy_202105 divvy_202106 divvy_202107
## 1 character character character character character character
## 2 character character character character character character
## divvy_202108 divvy_202109
## 1 character character
## 2 character character
There are mismatched data types regard to end_station_id and start_station_id. From April 2020 to Ocotober 2020, the end and start station id data types are integer. The left months are character. We need transfer these two columns into character type for 202004 - 202411. I will use as.character() to cope with that.
divvy_202004$start_station_id <- as.character(divvy_202004$start_station_id)
divvy_202004$end_station_id <- as.character(divvy_202004$end_station_id)
divvy_202005$start_station_id <- as.character(divvy_202005$start_station_id)
divvy_202005$end_station_id <- as.character(divvy_202005$end_station_id)
divvy_202006$start_station_id <- as.character(divvy_202006$start_station_id)
divvy_202006$end_station_id <- as.character(divvy_202006$end_station_id)
divvy_202007$start_station_id <- as.character(divvy_202007$start_station_id)
divvy_202007$end_station_id <- as.character(divvy_202007$end_station_id)
divvy_202008$start_station_id <- as.character(divvy_202008$start_station_id)
divvy_202008$end_station_id <- as.character(divvy_202008$end_station_id)
divvy_202009$start_station_id <- as.character(divvy_202009$start_station_id)
divvy_202009$end_station_id <- as.character(divvy_202009$end_station_id)
divvy_202010$start_station_id <- as.character(divvy_202010$start_station_id)
divvy_202010$end_station_id <- as.character(divvy_202010$end_station_id)
divvy_202011$start_station_id <- as.character(divvy_202011$start_station_id)
divvy_202011$end_station_id <- as.character(divvy_202011$end_station_id)
I will use the compare_df_cols() function again to check if there is any inconsistence.
compare_df_cols(divvy_202004, divvy_202005,divvy_202006, divvy_202007, divvy_202008, divvy_202009, divvy_202010, divvy_202011, divvy_202012, divvy_202101, divvy_202102,divvy_202103, return = "mismatch")
## [1] column_name divvy_202004 divvy_202005 divvy_202006 divvy_202007
## [6] divvy_202008 divvy_202009 divvy_202010 divvy_202011 divvy_202012
## [11] divvy_202101 divvy_202102 divvy_202103
## <0 rows> (or 0-length row.names)
Since there are no mismatched data types between the tables now, combining the 12 tables into a single one will be relatively straightforward using the bind_rows() function. The column names offer sufficient context for the data, hence, they will require minimal renaming for this analysis. Vertical merge data with rbind() as all of dataset have the same columns
trip_data <- rbind(divvy_202010, divvy_202011, divvy_202012, divvy_202101, divvy_202102, divvy_202103, divvy_202104, divvy_202105, divvy_202106, divvy_202107, divvy_202108, divvy_202109)
#View(trip_data)
str(trip_data)
## 'data.frame': 5136261 obs. of 13 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2020-10-31 19:39:43" "2020-10-31 23:50:08" "2020-10-31 23:00:01" "2020-10-31 22:16:43" ...
## $ ended_at : chr "2020-10-31 19:57:12" "2020-11-01 00:04:16" "2020-10-31 23:08:22" "2020-10-31 22:19:35" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
I merged twelve CSV files containing historical trip data from October 2020 to September 2021. Upon examining the data, we checked its structure, completeness, whitespace, and the number of unique values for categorical variables, along with preliminary summary statistics. This examination highlighted numerous incomplete fields. Notably, the station id, latitude, longitude variables were frequently missing.
I use rename() function to change the name of redieable_type to bike_type, member_casual to customer_type.
trip_data <- trip_data %>%
rename(bike_type = rideable_type,
customer_type = member_casual)
colnames(trip_data)
## [1] "ride_id" "bike_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "customer_type"
clean names; include only characters, numbers, and underscores in names.
trip_data <- clean_names(trip_data)
trip_data <- remove_empty(trip_data)
## value for "which" not specified, defaulting to c("rows", "cols")
trip_data <- distinct(trip_data)
check for naming anomalies, the variables should only include limited types.
count(trip_data, bike_type)
## bike_type n
## 1 classic_bike 2750831
## 2 docked_bike 677980
## 3 electric_bike 1707450
count(trip_data, customer_type)
## customer_type n
## 1 casual 2358287
## 2 member 2777974
summary(trip_data)
## ride_id bike_type started_at ended_at
## Length:5136261 Length:5136261 Length:5136261 Length:5136261
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5136261 Length:5136261 Length:5136261 Length:5136261
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.51 Min. :-88.07
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.08 Max. :-87.52 Max. :42.17 Max. :-87.44
## NA's :4821 NA's :4821
## customer_type
## Length:5136261
## Class :character
## Mode :character
##
##
##
##
dim(trip_data)
## [1] 5136261 13
head(trip_data)
## ride_id bike_type started_at ended_at
## 1 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 2 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 3 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 4 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 5 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## 6 DA6C3759660133DA electric_bike 2020-10-29 17:38:04 2020-10-29 17:45:43
## start_station_name start_station_id end_station_name
## 1 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 2 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 3 Stony Island Ave & 67th St 102 University Ave & 57th St
## 4 Clark St & Grace St 165 Broadway & Sheridan Rd
## 5 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## 6 Larrabee St & Division St 359 Wells St & Huron St
## end_station_id start_lat start_lng end_lat end_lng customer_type
## 1 125 41.92610 -87.63898 41.89035 -87.62607 casual
## 2 260 41.94817 -87.66391 41.92953 -87.70782 casual
## 3 423 41.77346 -87.58537 41.79145 -87.60005 casual
## 4 256 41.95085 -87.65924 41.95281 -87.65010 casual
## 5 185 41.92886 -87.66396 41.91778 -87.69143 casual
## 6 53 41.90353 -87.64335 41.89440 -87.63431 casual
tail(trip_data)
## ride_id bike_type started_at ended_at
## 5136256 0A6AA3B1A1EC5FF4 classic_bike 2021-09-14 23:00:37 2021-09-14 23:10:55
## 5136257 FA66BCAB0D73DDC2 classic_bike 2021-09-22 15:46:57 2021-09-22 16:01:15
## 5136258 1D44DEFB5D36CA04 classic_bike 2021-09-25 16:25:23 2021-09-25 16:40:29
## 5136259 6A346EA57FC23C45 classic_bike 2021-09-25 16:26:05 2021-09-25 16:40:30
## 5136260 49360AFD771100A6 classic_bike 2021-09-15 17:57:48 2021-09-15 18:24:06
## 5136261 343190A2DC023FED electric_bike 2021-09-11 18:01:06 2021-09-11 18:08:26
## start_station_name start_station_id end_station_name
## 5136256 Ellis Ave & 60th St KA1503000014 Shore Dr & 55th St
## 5136257 Ellis Ave & 83rd St 584 Stony Island Ave & 75th St
## 5136258 Ellis Ave & 60th St KA1503000014 Shore Dr & 55th St
## 5136259 Ellis Ave & 60th St KA1503000014 Shore Dr & 55th St
## 5136260 Ellis Ave & 60th St KA1503000014 Shore Dr & 55th St
## 5136261 Wells St & Huron St TA1306000012 Clinton St & Lake St
## end_station_id start_lat start_lng end_lat end_lng customer_type
## 5136256 TA1308000009 41.78510 -87.60107 41.79521 -87.58071 member
## 5136257 KA1503000019 41.74412 -87.59903 41.75867 -87.58688 casual
## 5136258 TA1308000009 41.78510 -87.60107 41.79521 -87.58071 casual
## 5136259 TA1308000009 41.78510 -87.60107 41.79521 -87.58071 casual
## 5136260 TA1308000009 41.78510 -87.60107 41.79521 -87.58071 casual
## 5136261 13021 41.89489 -87.63434 41.88577 -87.64137 member
glimpse(trip_data)
## Rows: 5,136,261
## Columns: 13
## $ ride_id <chr> "ACB6B40CF5B9044C", "DF450C72FD109C01", "B6396B54A1…
## $ bike_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "2020-10-31 19:39:43", "2020-10-31 23:50:08", "2020…
## $ ended_at <chr> "2020-10-31 19:57:12", "2020-11-01 00:04:16", "2020…
## $ start_station_name <chr> "Lakeview Ave & Fullerton Pkwy", "Southport Ave & W…
## $ start_station_id <chr> "313", "227", "102", "165", "190", "359", "313", "1…
## $ end_station_name <chr> "Rush St & Hubbard St", "Kedzie Ave & Milwaukee Ave…
## $ end_station_id <chr> "125", "260", "423", "256", "185", "53", "125", "31…
## $ start_lat <dbl> 41.92610, 41.94817, 41.77346, 41.95085, 41.92886, 4…
## $ start_lng <dbl> -87.63898, -87.66391, -87.58537, -87.65924, -87.663…
## $ end_lat <dbl> 41.89035, 41.92953, 41.79145, 41.95281, 41.91778, 4…
## $ end_lng <dbl> -87.62607, -87.70782, -87.60005, -87.65010, -87.691…
## $ customer_type <chr> "casual", "casual", "casual", "casual", "casual", "…
#attributes(trip_data)
attr(x = trip_data, which = "names")
## [1] "ride_id" "bike_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "customer_type"
str(trip_data)
## 'data.frame': 5136261 obs. of 13 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2020-10-31 19:39:43" "2020-10-31 23:50:08" "2020-10-31 23:00:01" "2020-10-31 22:16:43" ...
## $ ended_at : chr "2020-10-31 19:57:12" "2020-11-01 00:04:16" "2020-10-31 23:08:22" "2020-10-31 22:19:35" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
skim(trip_data)
| Name | trip_data |
| Number of rows | 5136261 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5136052 | 0 |
| bike_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1.00 | 19 | 19 | 0 | 4301706 | 0 |
| ended_at | 0 | 1.00 | 19 | 19 | 0 | 4291553 | 0 |
| start_station_name | 0 | 1.00 | 0 | 53 | 523467 | 785 | 0 |
| start_station_id | 55839 | 0.99 | 0 | 36 | 467942 | 1300 | 0 |
| end_station_name | 0 | 1.00 | 0 | 53 | 567268 | 782 | 0 |
| end_station_id | 62613 | 0.99 | 0 | 36 | 504888 | 1300 | 0 |
| customer_type | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.08 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▆▇▁ |
| end_lat | 4821 | 1 | 41.90 | 0.04 | 41.51 | 41.88 | 41.90 | 41.93 | 42.17 | ▁▁▇▆▁ |
| end_lng | 4821 | 1 | -87.65 | 0.03 | -88.07 | -87.66 | -87.64 | -87.63 | -87.44 | ▁▁▁▇▁ |
As we can see above, there are some missing data in the column of start_station_id, end_station_id, end_lat, and end_lng. And, in the column of start station name and end station name, there are some empty values. I wonder if it is possible to create unique tables through station names, and then use the unique tables as a reference to fill the missing data. Let’s try it.
stations <- distinct(trip_data, start_station_name, start_station_id)
summary(stations)
## start_station_name start_station_id
## Length:1372 Length:1372
## Class :character Class :character
## Mode :character Mode :character
#View(stations)
skim(stations)
| Name | stations |
| Number of rows | 1372 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| start_station_name | 0 | 1 | 0 | 53 | 5 | 785 | 0 |
| start_station_id | 4 | 1 | 0 | 36 | 1 | 1300 | 0 |
Here we notice that there are four missing and one empty in start_station_id. It means that there are five station names without their own id. And there are also 5 empty values in the start_station_name. It seems that it is impossible to fill the missing data by using the distinct station name.
end_station <- distinct(trip_data, end_station_name, end_station_id)
skim(end_station)
| Name | end_station |
| Number of rows | 1364 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| end_station_name | 0 | 1 | 0 | 53 | 2 | 782 | 0 |
| end_station_id | 3 | 1 | 0 | 36 | 1 | 1300 | 0 |
Here we notice that there are 3 missing and 1 empty in end_station_id. It means that there 4 end station name without their own id. There are also 2 empty values in the end_station_name. It is impossible to fill the missing data by using the distinct station name.
start_id_null <- filter(trip_data, is.na(start_station_id))
summary(start_id_null)
## ride_id bike_type started_at ended_at
## Length:55839 Length:55839 Length:55839 Length:55839
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:55839 Length:55839 Length:55839 Length:55839
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.80 Min. :41.54 Min. :-87.87
## 1st Qu.:41.80 1st Qu.:-87.69 1st Qu.:41.81 1st Qu.:-87.69
## Median :41.90 Median :-87.65 Median :41.90 Median :-87.65
## Mean :41.88 Mean :-87.65 Mean :41.88 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.62 3rd Qu.:41.93 3rd Qu.:-87.62
## Max. :42.08 Max. :-87.52 Max. :42.15 Max. :-87.52
## customer_type
## Length:55839
## Class :character
## Mode :character
##
##
##
#View(start_id_null)
trip_data %>%
select(start_station_name, start_station_id) %>%
filter(is.na(start_station_id) | start_station_id == "") %>%
filter(!start_station_name == "")
## start_station_name start_station_id
## 1 W Oakdale Ave & N Broadway <NA>
## 2 W Oakdale Ave & N Broadway <NA>
## 3 W Oakdale Ave & N Broadway <NA>
## 4 W Oakdale Ave & N Broadway <NA>
## 5 W Oakdale Ave & N Broadway <NA>
## 6 W Oakdale Ave & N Broadway <NA>
## 7 W Oakdale Ave & N Broadway <NA>
## 8 W Oakdale Ave & N Broadway <NA>
## 9 W Oakdale Ave & N Broadway <NA>
## 10 W Oakdale Ave & N Broadway <NA>
## 11 W Oakdale Ave & N Broadway <NA>
## 12 W Armitage Ave & N Sheffield Ave <NA>
## 13 W Oakdale Ave & N Broadway <NA>
## 14 W Oakdale Ave & N Broadway <NA>
## 15 W Oakdale Ave & N Broadway <NA>
## 16 W Armitage Ave & N Sheffield Ave <NA>
## 17 W Armitage Ave & N Sheffield Ave <NA>
## 18 W Oakdale Ave & N Broadway <NA>
## 19 W Oakdale Ave & N Broadway <NA>
## 20 W Oakdale Ave & N Broadway <NA>
## 21 W Oakdale Ave & N Broadway <NA>
## 22 W Armitage Ave & N Sheffield Ave <NA>
## 23 W Oakdale Ave & N Broadway <NA>
## 24 W Oakdale Ave & N Broadway <NA>
## 25 W Oakdale Ave & N Broadway <NA>
## 26 W Oakdale Ave & N Broadway <NA>
## 27 W Armitage Ave & N Sheffield Ave <NA>
## 28 W Oakdale Ave & N Broadway <NA>
## 29 W Armitage Ave & N Sheffield Ave <NA>
## 30 W Oakdale Ave & N Broadway <NA>
## 31 W Oakdale Ave & N Broadway <NA>
## 32 W Oakdale Ave & N Broadway <NA>
## 33 W Oakdale Ave & N Broadway <NA>
## 34 W Armitage Ave & N Sheffield Ave <NA>
## 35 W Armitage Ave & N Sheffield Ave <NA>
## 36 W Oakdale Ave & N Broadway <NA>
## 37 W Oakdale Ave & N Broadway <NA>
## 38 W Oakdale Ave & N Broadway <NA>
## 39 W Oakdale Ave & N Broadway <NA>
## 40 W Oakdale Ave & N Broadway <NA>
## 41 W Oakdale Ave & N Broadway <NA>
## 42 W Armitage Ave & N Sheffield Ave <NA>
## 43 W Oakdale Ave & N Broadway <NA>
## 44 W Oakdale Ave & N Broadway <NA>
## 45 W Oakdale Ave & N Broadway <NA>
## 46 W Oakdale Ave & N Broadway <NA>
## 47 W Oakdale Ave & N Broadway <NA>
## 48 W Armitage Ave & N Sheffield Ave <NA>
## 49 W Oakdale Ave & N Broadway <NA>
## 50 W Oakdale Ave & N Broadway <NA>
## 51 W Oakdale Ave & N Broadway <NA>
## 52 W Oakdale Ave & N Broadway <NA>
## 53 W Oakdale Ave & N Broadway <NA>
## 54 W Oakdale Ave & N Broadway <NA>
## 55 W Oakdale Ave & N Broadway <NA>
## 56 W Armitage Ave & N Sheffield Ave <NA>
## 57 W Armitage Ave & N Sheffield Ave <NA>
## 58 W Oakdale Ave & N Broadway <NA>
## 59 W Oakdale Ave & N Broadway <NA>
## 60 W Armitage Ave & N Sheffield Ave <NA>
## 61 W Oakdale Ave & N Broadway <NA>
## 62 W Oakdale Ave & N Broadway <NA>
## 63 W Oakdale Ave & N Broadway <NA>
## 64 W Oakdale Ave & N Broadway <NA>
## 65 W Oakdale Ave & N Broadway <NA>
## 66 W Armitage Ave & N Sheffield Ave <NA>
## 67 W Oakdale Ave & N Broadway <NA>
## 68 W Oakdale Ave & N Broadway <NA>
## 69 W Oakdale Ave & N Broadway <NA>
## 70 W Oakdale Ave & N Broadway <NA>
## 71 W Armitage Ave & N Sheffield Ave <NA>
## 72 W Oakdale Ave & N Broadway <NA>
## 73 W Armitage Ave & N Sheffield Ave <NA>
## 74 W Oakdale Ave & N Broadway <NA>
## 75 W Oakdale Ave & N Broadway <NA>
## 76 W Armitage Ave & N Sheffield Ave <NA>
## 77 W Oakdale Ave & N Broadway <NA>
## 78 W Oakdale Ave & N Broadway <NA>
## 79 W Oakdale Ave & N Broadway <NA>
## 80 W Oakdale Ave & N Broadway <NA>
## 81 W Oakdale Ave & N Broadway <NA>
## 82 W Oakdale Ave & N Broadway <NA>
## 83 W Oakdale Ave & N Broadway <NA>
## 84 W Oakdale Ave & N Broadway <NA>
## 85 W Oakdale Ave & N Broadway <NA>
## 86 W Oakdale Ave & N Broadway <NA>
## 87 W Armitage Ave & N Sheffield Ave <NA>
## 88 W Armitage Ave & N Sheffield Ave <NA>
## 89 W Armitage Ave & N Sheffield Ave <NA>
## 90 W Oakdale Ave & N Broadway <NA>
## 91 W Oakdale Ave & N Broadway <NA>
## 92 W Oakdale Ave & N Broadway <NA>
## 93 W Oakdale Ave & N Broadway <NA>
## 94 W Oakdale Ave & N Broadway <NA>
## 95 W Oakdale Ave & N Broadway <NA>
## 96 W Oakdale Ave & N Broadway <NA>
## 97 W Oakdale Ave & N Broadway <NA>
## 98 W Oakdale Ave & N Broadway <NA>
## 99 W Oakdale Ave & N Broadway <NA>
## 100 W Oakdale Ave & N Broadway <NA>
## 101 W Oakdale Ave & N Broadway <NA>
## 102 W Oakdale Ave & N Broadway <NA>
## 103 W Armitage Ave & N Sheffield Ave <NA>
## 104 W Oakdale Ave & N Broadway <NA>
## 105 W Oakdale Ave & N Broadway <NA>
## 106 W Oakdale Ave & N Broadway <NA>
## 107 W Oakdale Ave & N Broadway <NA>
## 108 W Oakdale Ave & N Broadway <NA>
## 109 W Oakdale Ave & N Broadway <NA>
## 110 W Oakdale Ave & N Broadway <NA>
## 111 W Oakdale Ave & N Broadway <NA>
## 112 W Oakdale Ave & N Broadway <NA>
## 113 W Oakdale Ave & N Broadway <NA>
## 114 W Oakdale Ave & N Broadway <NA>
## 115 W Armitage Ave & N Sheffield Ave <NA>
## 116 W Armitage Ave & N Sheffield Ave <NA>
## 117 W Oakdale Ave & N Broadway <NA>
## 118 W Oakdale Ave & N Broadway <NA>
## 119 W Oakdale Ave & N Broadway <NA>
## 120 W Armitage Ave & N Sheffield Ave <NA>
## 121 W Armitage Ave & N Sheffield Ave <NA>
## 122 W Oakdale Ave & N Broadway <NA>
## 123 W Oakdale Ave & N Broadway <NA>
## 124 W Armitage Ave & N Sheffield Ave <NA>
## 125 W Oakdale Ave & N Broadway <NA>
## 126 W Oakdale Ave & N Broadway <NA>
## 127 W Oakdale Ave & N Broadway <NA>
## 128 W Oakdale Ave & N Broadway <NA>
## 129 W Armitage Ave & N Sheffield Ave <NA>
## 130 W Oakdale Ave & N Broadway <NA>
## 131 W Oakdale Ave & N Broadway <NA>
## 132 W Armitage Ave & N Sheffield Ave <NA>
## 133 W Oakdale Ave & N Broadway <NA>
## 134 W Oakdale Ave & N Broadway <NA>
## 135 W Armitage Ave & N Sheffield Ave <NA>
## 136 W Oakdale Ave & N Broadway <NA>
## 137 W Oakdale Ave & N Broadway <NA>
## 138 W Oakdale Ave & N Broadway <NA>
## 139 W Oakdale Ave & N Broadway <NA>
## 140 W Oakdale Ave & N Broadway <NA>
## 141 W Oakdale Ave & N Broadway <NA>
## 142 W Oakdale Ave & N Broadway <NA>
## 143 W Oakdale Ave & N Broadway <NA>
## 144 W Oakdale Ave & N Broadway <NA>
## 145 W Oakdale Ave & N Broadway <NA>
## 146 W Oakdale Ave & N Broadway <NA>
## 147 W Armitage Ave & N Sheffield Ave <NA>
## 148 W Oakdale Ave & N Broadway <NA>
## 149 W Armitage Ave & N Sheffield Ave <NA>
## 150 W Oakdale Ave & N Broadway <NA>
## 151 W Oakdale Ave & N Broadway <NA>
## 152 W Oakdale Ave & N Broadway <NA>
## 153 W Oakdale Ave & N Broadway <NA>
## 154 W Oakdale Ave & N Broadway <NA>
## 155 W Oakdale Ave & N Broadway <NA>
## 156 W Oakdale Ave & N Broadway <NA>
## 157 W Armitage Ave & N Sheffield Ave <NA>
## 158 W Armitage Ave & N Sheffield Ave <NA>
## 159 W Oakdale Ave & N Broadway <NA>
## 160 W Armitage Ave & N Sheffield Ave <NA>
## 161 W Oakdale Ave & N Broadway <NA>
## 162 W Armitage Ave & N Sheffield Ave <NA>
## 163 W Armitage Ave & N Sheffield Ave <NA>
## 164 W Armitage Ave & N Sheffield Ave <NA>
## 165 W Oakdale Ave & N Broadway <NA>
## 166 W Armitage Ave & N Sheffield Ave <NA>
## 167 W Oakdale Ave & N Broadway <NA>
## 168 W Armitage Ave & N Sheffield Ave <NA>
## 169 W Oakdale Ave & N Broadway <NA>
## 170 W Armitage Ave & N Sheffield Ave <NA>
## 171 W Oakdale Ave & N Broadway <NA>
## 172 W Armitage Ave & N Sheffield Ave <NA>
## 173 W Armitage Ave & N Sheffield Ave <NA>
## 174 W Oakdale Ave & N Broadway <NA>
## 175 W Oakdale Ave & N Broadway <NA>
## 176 W Oakdale Ave & N Broadway <NA>
## 177 W Armitage Ave & N Sheffield Ave <NA>
## 178 W Oakdale Ave & N Broadway <NA>
## 179 W Oakdale Ave & N Broadway <NA>
## 180 W Oakdale Ave & N Broadway <NA>
## 181 W Oakdale Ave & N Broadway <NA>
## 182 W Oakdale Ave & N Broadway <NA>
## 183 W Oakdale Ave & N Broadway <NA>
## 184 W Oakdale Ave & N Broadway <NA>
## 185 W Oakdale Ave & N Broadway <NA>
## 186 W Oakdale Ave & N Broadway <NA>
## 187 W Oakdale Ave & N Broadway <NA>
## 188 W Oakdale Ave & N Broadway <NA>
## 189 W Oakdale Ave & N Broadway <NA>
## 190 W Oakdale Ave & N Broadway <NA>
## 191 W Oakdale Ave & N Broadway <NA>
## 192 W Oakdale Ave & N Broadway <NA>
## 193 W Oakdale Ave & N Broadway <NA>
## 194 W Oakdale Ave & N Broadway <NA>
## 195 W Oakdale Ave & N Broadway <NA>
## 196 W Oakdale Ave & N Broadway <NA>
## 197 W Oakdale Ave & N Broadway <NA>
## 198 W Oakdale Ave & N Broadway <NA>
## 199 W Armitage Ave & N Sheffield Ave <NA>
## 200 W Armitage Ave & N Sheffield Ave <NA>
## 201 W Oakdale Ave & N Broadway <NA>
## 202 W Oakdale Ave & N Broadway <NA>
## 203 W Oakdale Ave & N Broadway <NA>
## 204 W Oakdale Ave & N Broadway <NA>
## 205 W Oakdale Ave & N Broadway <NA>
## 206 W Oakdale Ave & N Broadway <NA>
## 207 W Oakdale Ave & N Broadway <NA>
## 208 W Oakdale Ave & N Broadway <NA>
## 209 W Oakdale Ave & N Broadway <NA>
## 210 W Oakdale Ave & N Broadway <NA>
## 211 W Oakdale Ave & N Broadway <NA>
## 212 W Oakdale Ave & N Broadway <NA>
## 213 W Oakdale Ave & N Broadway <NA>
## 214 W Oakdale Ave & N Broadway <NA>
## 215 W Oakdale Ave & N Broadway <NA>
## 216 W Oakdale Ave & N Broadway <NA>
## 217 W Oakdale Ave & N Broadway <NA>
## 218 W Oakdale Ave & N Broadway <NA>
## 219 W Armitage Ave & N Sheffield Ave <NA>
## 220 W Oakdale Ave & N Broadway <NA>
## 221 W Oakdale Ave & N Broadway <NA>
## 222 W Oakdale Ave & N Broadway <NA>
## 223 W Oakdale Ave & N Broadway <NA>
## 224 W Armitage Ave & N Sheffield Ave <NA>
## 225 W Oakdale Ave & N Broadway <NA>
## 226 W Armitage Ave & N Sheffield Ave <NA>
## 227 W Oakdale Ave & N Broadway <NA>
## 228 W Oakdale Ave & N Broadway <NA>
## 229 W Oakdale Ave & N Broadway <NA>
## 230 W Oakdale Ave & N Broadway <NA>
## 231 W Oakdale Ave & N Broadway <NA>
## 232 W Oakdale Ave & N Broadway <NA>
## 233 W Oakdale Ave & N Broadway <NA>
## 234 W Oakdale Ave & N Broadway <NA>
## 235 W Oakdale Ave & N Broadway <NA>
## 236 W Oakdale Ave & N Broadway <NA>
## 237 W Oakdale Ave & N Broadway <NA>
## 238 W Oakdale Ave & N Broadway <NA>
## 239 S Michigan Ave & E 118th St <NA>
## 240 W Oakdale Ave & N Broadway <NA>
## 241 W Oakdale Ave & N Broadway <NA>
## 242 W Oakdale Ave & N Broadway <NA>
## 243 W Oakdale Ave & N Broadway <NA>
## 244 W Oakdale Ave & N Broadway <NA>
## 245 W Oakdale Ave & N Broadway <NA>
## 246 W Oakdale Ave & N Broadway <NA>
## 247 W Oakdale Ave & N Broadway <NA>
## 248 W Oakdale Ave & N Broadway <NA>
## 249 W Oakdale Ave & N Broadway <NA>
## 250 W Oakdale Ave & N Broadway <NA>
## 251 W Oakdale Ave & N Broadway <NA>
## 252 W Oakdale Ave & N Broadway <NA>
## 253 W Oakdale Ave & N Broadway <NA>
## 254 W Oakdale Ave & N Broadway <NA>
## 255 W Armitage Ave & N Sheffield Ave <NA>
## 256 W Oakdale Ave & N Broadway <NA>
## 257 W Oakdale Ave & N Broadway <NA>
## 258 W Oakdale Ave & N Broadway <NA>
## 259 W Oakdale Ave & N Broadway <NA>
## 260 W Oakdale Ave & N Broadway <NA>
## 261 W Oakdale Ave & N Broadway <NA>
## 262 W Oakdale Ave & N Broadway <NA>
## 263 W Oakdale Ave & N Broadway <NA>
## 264 W Oakdale Ave & N Broadway <NA>
## 265 W Oakdale Ave & N Broadway <NA>
## 266 W Oakdale Ave & N Broadway <NA>
## 267 W Oakdale Ave & N Broadway <NA>
## 268 W Oakdale Ave & N Broadway <NA>
## 269 W Armitage Ave & N Sheffield Ave <NA>
## 270 W Oakdale Ave & N Broadway <NA>
## 271 W Oakdale Ave & N Broadway <NA>
## 272 W Oakdale Ave & N Broadway <NA>
## 273 W Oakdale Ave & N Broadway <NA>
## 274 W Oakdale Ave & N Broadway <NA>
## 275 W Oakdale Ave & N Broadway <NA>
## 276 W Armitage Ave & N Sheffield Ave <NA>
## 277 W Armitage Ave & N Sheffield Ave <NA>
## 278 W Oakdale Ave & N Broadway <NA>
## 279 W Oakdale Ave & N Broadway <NA>
## 280 W Oakdale Ave & N Broadway <NA>
## 281 W Oakdale Ave & N Broadway <NA>
## 282 W Oakdale Ave & N Broadway <NA>
## 283 W Oakdale Ave & N Broadway <NA>
## 284 W Oakdale Ave & N Broadway <NA>
## 285 W Oakdale Ave & N Broadway <NA>
## 286 W Oakdale Ave & N Broadway <NA>
## 287 W Oakdale Ave & N Broadway <NA>
## 288 W Oakdale Ave & N Broadway <NA>
## 289 W Oakdale Ave & N Broadway <NA>
## 290 W Oakdale Ave & N Broadway <NA>
## 291 W Oakdale Ave & N Broadway <NA>
## 292 W Oakdale Ave & N Broadway <NA>
## 293 W Oakdale Ave & N Broadway <NA>
## 294 W Oakdale Ave & N Broadway <NA>
## 295 W Oakdale Ave & N Broadway <NA>
## 296 W Oakdale Ave & N Broadway <NA>
## 297 W Oakdale Ave & N Broadway <NA>
## 298 W Oakdale Ave & N Broadway <NA>
## 299 W Oakdale Ave & N Broadway <NA>
## 300 W Oakdale Ave & N Broadway <NA>
## 301 W Oakdale Ave & N Broadway <NA>
## 302 W Oakdale Ave & N Broadway <NA>
## 303 W Oakdale Ave & N Broadway <NA>
## 304 W Oakdale Ave & N Broadway <NA>
## 305 W Oakdale Ave & N Broadway <NA>
## 306 W Oakdale Ave & N Broadway <NA>
## 307 W Oakdale Ave & N Broadway <NA>
## 308 W Oakdale Ave & N Broadway <NA>
## 309 W Oakdale Ave & N Broadway <NA>
## 310 W Oakdale Ave & N Broadway <NA>
## 311 W Oakdale Ave & N Broadway <NA>
## 312 W Oakdale Ave & N Broadway <NA>
## 313 W Oakdale Ave & N Broadway <NA>
## 314 W Oakdale Ave & N Broadway <NA>
## 315 W Oakdale Ave & N Broadway <NA>
## 316 W Oakdale Ave & N Broadway <NA>
## 317 W Oakdale Ave & N Broadway <NA>
We can see that, when the id is NA, the station name is usually empty. There are only 317 rows where the station_id is NA or empty while the station_name is not empty.
str(start_id_null)
## 'data.frame': 55839 obs. of 13 variables:
## $ ride_id : chr "285D224410C101C5" "774087F3F7887F0C" "5260DF288DA43EF9" "1C4BF0BF80B84E90" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2020-10-28 23:12:03" "2020-10-25 22:55:10" "2020-10-13 14:38:50" "2020-10-13 07:51:13" ...
## $ ended_at : chr "2020-10-28 23:24:32" "2020-10-25 23:14:26" "2020-10-13 14:39:35" "2020-10-13 08:27:09" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr NA NA NA NA ...
## $ end_station_name : chr "Wabash Ave & Grand Ave" "Damen Ave & Pierce Ave" "" "Daley Center Plaza" ...
## $ end_station_id : chr "199" "69" NA "81" ...
## $ start_lat : num 41.9 41.9 41.9 42 41.7 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.7 ...
## $ end_lng : num -87.6 -87.7 -87.7 -87.6 -87.6 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
unique(start_id_null$bike_type)
## [1] "electric_bike"
It is only the electric bikes has NA or empty start_station_id values.
end_id_null <- trip_data %>%
filter(is.na(end_station_id))
unique(end_id_null$bike_type)
## [1] "electric_bike" "docked_bike"
However, there are electric bikes and docked bikes have NA or empty start_station_id values. It is impossible to determine what causes the NA or empty station name and id values.
trip_data %>%
select(end_station_name, end_station_id) %>%
filter(is.na(end_station_id) | end_station_id == "") %>%
filter(!end_station_name == "")
## end_station_name end_station_id
## 1 W Oakdale Ave & N Broadway <NA>
## 2 W Oakdale Ave & N Broadway <NA>
## 3 W Armitage Ave & N Sheffield Ave <NA>
## 4 W Oakdale Ave & N Broadway <NA>
## 5 W Oakdale Ave & N Broadway <NA>
## 6 W Oakdale Ave & N Broadway <NA>
## 7 W Oakdale Ave & N Broadway <NA>
## 8 W Oakdale Ave & N Broadway <NA>
## 9 W Oakdale Ave & N Broadway <NA>
## 10 W Oakdale Ave & N Broadway <NA>
## 11 W Oakdale Ave & N Broadway <NA>
## 12 W Oakdale Ave & N Broadway <NA>
## 13 W Oakdale Ave & N Broadway <NA>
## 14 W Oakdale Ave & N Broadway <NA>
## 15 W Oakdale Ave & N Broadway <NA>
## 16 W Oakdale Ave & N Broadway <NA>
## 17 W Armitage Ave & N Sheffield Ave <NA>
## 18 W Oakdale Ave & N Broadway <NA>
## 19 W Oakdale Ave & N Broadway <NA>
## 20 W Armitage Ave & N Sheffield Ave <NA>
## 21 W Oakdale Ave & N Broadway <NA>
## 22 W Oakdale Ave & N Broadway <NA>
## 23 W Oakdale Ave & N Broadway <NA>
## 24 W Oakdale Ave & N Broadway <NA>
## 25 W Oakdale Ave & N Broadway <NA>
## 26 W Armitage Ave & N Sheffield Ave <NA>
## 27 W Armitage Ave & N Sheffield Ave <NA>
## 28 W Oakdale Ave & N Broadway <NA>
## 29 W Oakdale Ave & N Broadway <NA>
## 30 W Oakdale Ave & N Broadway <NA>
## 31 W Oakdale Ave & N Broadway <NA>
## 32 W Oakdale Ave & N Broadway <NA>
## 33 W Armitage Ave & N Sheffield Ave <NA>
## 34 W Oakdale Ave & N Broadway <NA>
## 35 W Oakdale Ave & N Broadway <NA>
## 36 W Oakdale Ave & N Broadway <NA>
## 37 W Armitage Ave & N Sheffield Ave <NA>
## 38 W Oakdale Ave & N Broadway <NA>
## 39 W Oakdale Ave & N Broadway <NA>
## 40 W Oakdale Ave & N Broadway <NA>
## 41 W Armitage Ave & N Sheffield Ave <NA>
## 42 W Oakdale Ave & N Broadway <NA>
## 43 W Oakdale Ave & N Broadway <NA>
## 44 W Oakdale Ave & N Broadway <NA>
## 45 W Oakdale Ave & N Broadway <NA>
## 46 W Armitage Ave & N Sheffield Ave <NA>
## 47 W Armitage Ave & N Sheffield Ave <NA>
## 48 W Oakdale Ave & N Broadway <NA>
## 49 W Oakdale Ave & N Broadway <NA>
## 50 W Oakdale Ave & N Broadway <NA>
## 51 W Armitage Ave & N Sheffield Ave <NA>
## 52 W Oakdale Ave & N Broadway <NA>
## 53 W Oakdale Ave & N Broadway <NA>
## 54 W Oakdale Ave & N Broadway <NA>
## 55 W Oakdale Ave & N Broadway <NA>
## 56 W Oakdale Ave & N Broadway <NA>
## 57 W Armitage Ave & N Sheffield Ave <NA>
## 58 W Oakdale Ave & N Broadway <NA>
## 59 W Oakdale Ave & N Broadway <NA>
## 60 W Oakdale Ave & N Broadway <NA>
## 61 W Oakdale Ave & N Broadway <NA>
## 62 W Oakdale Ave & N Broadway <NA>
## 63 W Oakdale Ave & N Broadway <NA>
## 64 W Oakdale Ave & N Broadway <NA>
## 65 W Oakdale Ave & N Broadway <NA>
## 66 W Oakdale Ave & N Broadway <NA>
## 67 W Oakdale Ave & N Broadway <NA>
## 68 W Oakdale Ave & N Broadway <NA>
## 69 W Oakdale Ave & N Broadway <NA>
## 70 W Oakdale Ave & N Broadway <NA>
## 71 W Oakdale Ave & N Broadway <NA>
## 72 W Oakdale Ave & N Broadway <NA>
## 73 W Oakdale Ave & N Broadway <NA>
## 74 W Oakdale Ave & N Broadway <NA>
## 75 W Armitage Ave & N Sheffield Ave <NA>
## 76 W Armitage Ave & N Sheffield Ave <NA>
## 77 W Armitage Ave & N Sheffield Ave <NA>
## 78 W Oakdale Ave & N Broadway <NA>
## 79 W Oakdale Ave & N Broadway <NA>
## 80 W Oakdale Ave & N Broadway <NA>
## 81 W Armitage Ave & N Sheffield Ave <NA>
## 82 W Oakdale Ave & N Broadway <NA>
## 83 W Oakdale Ave & N Broadway <NA>
## 84 W Oakdale Ave & N Broadway <NA>
## 85 W Oakdale Ave & N Broadway <NA>
## 86 W Armitage Ave & N Sheffield Ave <NA>
## 87 W Oakdale Ave & N Broadway <NA>
## 88 W Oakdale Ave & N Broadway <NA>
## 89 W Oakdale Ave & N Broadway <NA>
## 90 W Oakdale Ave & N Broadway <NA>
## 91 W Oakdale Ave & N Broadway <NA>
## 92 W Oakdale Ave & N Broadway <NA>
## 93 W Armitage Ave & N Sheffield Ave <NA>
## 94 W Oakdale Ave & N Broadway <NA>
## 95 W Armitage Ave & N Sheffield Ave <NA>
## 96 W Oakdale Ave & N Broadway <NA>
## 97 W Oakdale Ave & N Broadway <NA>
## 98 W Oakdale Ave & N Broadway <NA>
## 99 W Oakdale Ave & N Broadway <NA>
## 100 W Armitage Ave & N Sheffield Ave <NA>
## 101 W Oakdale Ave & N Broadway <NA>
## 102 W Oakdale Ave & N Broadway <NA>
## 103 W Oakdale Ave & N Broadway <NA>
## 104 W Oakdale Ave & N Broadway <NA>
## 105 W Oakdale Ave & N Broadway <NA>
## 106 W Oakdale Ave & N Broadway <NA>
## 107 W Oakdale Ave & N Broadway <NA>
## 108 W Oakdale Ave & N Broadway <NA>
## 109 W Oakdale Ave & N Broadway <NA>
## 110 W Oakdale Ave & N Broadway <NA>
## 111 W Oakdale Ave & N Broadway <NA>
## 112 W Armitage Ave & N Sheffield Ave <NA>
## 113 W Oakdale Ave & N Broadway <NA>
## 114 W Oakdale Ave & N Broadway <NA>
## 115 W Oakdale Ave & N Broadway <NA>
## 116 W Oakdale Ave & N Broadway <NA>
## 117 W Oakdale Ave & N Broadway <NA>
## 118 W Oakdale Ave & N Broadway <NA>
## 119 W Armitage Ave & N Sheffield Ave <NA>
## 120 W Oakdale Ave & N Broadway <NA>
## 121 W Oakdale Ave & N Broadway <NA>
## 122 W Oakdale Ave & N Broadway <NA>
## 123 W Armitage Ave & N Sheffield Ave <NA>
## 124 W Armitage Ave & N Sheffield Ave <NA>
## 125 W Oakdale Ave & N Broadway <NA>
## 126 W Oakdale Ave & N Broadway <NA>
## 127 W Oakdale Ave & N Broadway <NA>
## 128 W Oakdale Ave & N Broadway <NA>
## 129 W Oakdale Ave & N Broadway <NA>
## 130 W Armitage Ave & N Sheffield Ave <NA>
## 131 W Armitage Ave & N Sheffield Ave <NA>
## 132 W Oakdale Ave & N Broadway <NA>
## 133 W Oakdale Ave & N Broadway <NA>
## 134 W Oakdale Ave & N Broadway <NA>
## 135 W Oakdale Ave & N Broadway <NA>
## 136 W Oakdale Ave & N Broadway <NA>
## 137 W Oakdale Ave & N Broadway <NA>
## 138 W Oakdale Ave & N Broadway <NA>
## 139 W Oakdale Ave & N Broadway <NA>
## 140 W Armitage Ave & N Sheffield Ave <NA>
## 141 W Oakdale Ave & N Broadway <NA>
## 142 W Oakdale Ave & N Broadway <NA>
## 143 W Oakdale Ave & N Broadway <NA>
## 144 W Oakdale Ave & N Broadway <NA>
## 145 W Oakdale Ave & N Broadway <NA>
## 146 W Oakdale Ave & N Broadway <NA>
## 147 W Oakdale Ave & N Broadway <NA>
## 148 W Oakdale Ave & N Broadway <NA>
## 149 W Oakdale Ave & N Broadway <NA>
## 150 W Oakdale Ave & N Broadway <NA>
## 151 W Oakdale Ave & N Broadway <NA>
## 152 W Oakdale Ave & N Broadway <NA>
## 153 W Oakdale Ave & N Broadway <NA>
## 154 W Oakdale Ave & N Broadway <NA>
## 155 W Oakdale Ave & N Broadway <NA>
## 156 W Armitage Ave & N Sheffield Ave <NA>
## 157 W Oakdale Ave & N Broadway <NA>
## 158 W Oakdale Ave & N Broadway <NA>
## 159 W Oakdale Ave & N Broadway <NA>
## 160 W Oakdale Ave & N Broadway <NA>
## 161 W Oakdale Ave & N Broadway <NA>
## 162 W Oakdale Ave & N Broadway <NA>
## 163 W Oakdale Ave & N Broadway <NA>
## 164 W Oakdale Ave & N Broadway <NA>
## 165 W Oakdale Ave & N Broadway <NA>
## 166 W Oakdale Ave & N Broadway <NA>
## 167 W Oakdale Ave & N Broadway <NA>
## 168 W Oakdale Ave & N Broadway <NA>
## 169 W Oakdale Ave & N Broadway <NA>
## 170 W Oakdale Ave & N Broadway <NA>
## 171 W Oakdale Ave & N Broadway <NA>
## 172 W Oakdale Ave & N Broadway <NA>
## 173 W Oakdale Ave & N Broadway <NA>
## 174 W Oakdale Ave & N Broadway <NA>
## 175 W Oakdale Ave & N Broadway <NA>
## 176 W Oakdale Ave & N Broadway <NA>
## 177 W Oakdale Ave & N Broadway <NA>
## 178 W Oakdale Ave & N Broadway <NA>
## 179 W Oakdale Ave & N Broadway <NA>
## 180 W Oakdale Ave & N Broadway <NA>
## 181 W Oakdale Ave & N Broadway <NA>
## 182 W Oakdale Ave & N Broadway <NA>
## 183 W Oakdale Ave & N Broadway <NA>
## 184 W Oakdale Ave & N Broadway <NA>
## 185 W Oakdale Ave & N Broadway <NA>
## 186 W Oakdale Ave & N Broadway <NA>
## 187 W Oakdale Ave & N Broadway <NA>
## 188 W Oakdale Ave & N Broadway <NA>
## 189 W Oakdale Ave & N Broadway <NA>
## 190 W Oakdale Ave & N Broadway <NA>
## 191 W Oakdale Ave & N Broadway <NA>
## 192 W Oakdale Ave & N Broadway <NA>
## 193 W Oakdale Ave & N Broadway <NA>
## 194 W Oakdale Ave & N Broadway <NA>
## 195 W Oakdale Ave & N Broadway <NA>
## 196 W Oakdale Ave & N Broadway <NA>
## 197 W Oakdale Ave & N Broadway <NA>
## 198 W Oakdale Ave & N Broadway <NA>
## 199 W Oakdale Ave & N Broadway <NA>
## 200 W Oakdale Ave & N Broadway <NA>
## 201 W Oakdale Ave & N Broadway <NA>
## 202 W Oakdale Ave & N Broadway <NA>
## 203 W Oakdale Ave & N Broadway <NA>
## 204 W Oakdale Ave & N Broadway <NA>
## 205 W Oakdale Ave & N Broadway <NA>
## 206 W Oakdale Ave & N Broadway <NA>
## 207 W Oakdale Ave & N Broadway <NA>
## 208 W Oakdale Ave & N Broadway <NA>
## 209 W Oakdale Ave & N Broadway <NA>
## 210 W Oakdale Ave & N Broadway <NA>
## 211 W Oakdale Ave & N Broadway <NA>
## 212 W Oakdale Ave & N Broadway <NA>
## 213 W Oakdale Ave & N Broadway <NA>
## 214 W Oakdale Ave & N Broadway <NA>
## 215 W Oakdale Ave & N Broadway <NA>
## 216 W Oakdale Ave & N Broadway <NA>
## 217 W Oakdale Ave & N Broadway <NA>
## 218 W Oakdale Ave & N Broadway <NA>
## 219 W Oakdale Ave & N Broadway <NA>
## 220 W Oakdale Ave & N Broadway <NA>
## 221 W Oakdale Ave & N Broadway <NA>
## 222 W Oakdale Ave & N Broadway <NA>
## 223 W Oakdale Ave & N Broadway <NA>
## 224 W Oakdale Ave & N Broadway <NA>
## 225 W Oakdale Ave & N Broadway <NA>
## 226 W Oakdale Ave & N Broadway <NA>
## 227 W Oakdale Ave & N Broadway <NA>
## 228 W Oakdale Ave & N Broadway <NA>
## 229 W Oakdale Ave & N Broadway <NA>
## 230 W Oakdale Ave & N Broadway <NA>
## 231 W Oakdale Ave & N Broadway <NA>
## 232 W Oakdale Ave & N Broadway <NA>
## 233 W Oakdale Ave & N Broadway <NA>
There are only 233 rows where the end_station_id is NA or empty and the station_name is not empty. I tend to remove all of the values where the start_station_name and end_station_name is empty as I will use these two columns to analyze the top stations.
unique(end_id_null$customer_type)
## [1] "casual" "member"
end_lat_null <- trip_data %>%
filter(is.na(end_lat))
unique(end_lat_null$bike_type)
## [1] "docked_bike" "classic_bike"
As shown above, I apply filters to explore the dataset and examine the null rows and corresponding columns. I notice that there is only electric bike with missing start_station_id. For the missing end_station_id, end_lat, and end_lng, it includes docked bike and electric bike. As I couldn’t access the stakeholders, it is impossible to confirm with them what causes the missing values in these columns. As I am going to use the geographical points to map stations trends, I will remove those rows with missing values in end_lat and end_lng.
bike_rides <- trip_data %>%
filter_at(vars(end_lat, end_lng), all_vars(!is.na(.))) %>%
filter(!start_station_name == "" & !end_station_name =="")
skim(bike_rides)
| Name | bike_rides |
| Number of rows | 4354488 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 4354280 | 0 |
| bike_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 3732831 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 3721211 | 0 |
| start_station_name | 0 | 1 | 3 | 53 | 0 | 781 | 0 |
| start_station_id | 259 | 1 | 1 | 36 | 0 | 1296 | 0 |
| end_station_name | 0 | 1 | 10 | 53 | 0 | 776 | 0 |
| end_station_id | 194 | 1 | 1 | 36 | 0 | 1294 | 0 |
| customer_type | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.53 | ▁▁▅▇▁ |
| end_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.17 | ▁▁▇▁▁ |
| end_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▆▇▁ |
summary(bike_rides)
## ride_id bike_type started_at ended_at
## Length:4354488 Length:4354488 Length:4354488 Length:4354488
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:4354488 Length:4354488 Length:4354488 Length:4354488
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.65 Min. :-87.83 Min. :41.65 Min. :-87.83
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.64 Mean :41.90 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53 Max. :42.17 Max. :-87.52
## customer_type
## Length:4354488
## Class :character
## Mode :character
##
##
##
Right now, there is no missing values in the columns of start_station_name, end_station_name, end_lat and end_lng. I will not remove the null values in the start_station_id and end_station_id as I would not use these two columns in the following analysis.
Next, I will add some calculation columns to the bike_rides data set, like year, month, day, hour and ride duration to build some granularity for my analysis.
First, I will format the date type of started_at and ended_at into POSIXct type.
bike_rides$started_at <- as.POSIXct(bike_rides$started_at)
bike_rides$ended_at <- as.POSIXct(bike_rides$ended_at)
head(bike_rides)
## ride_id bike_type started_at ended_at
## 1 ACB6B40CF5B9044C electric_bike 2020-10-31 19:39:43 2020-10-31 19:57:12
## 2 DF450C72FD109C01 electric_bike 2020-10-31 23:50:08 2020-11-01 00:04:16
## 3 B6396B54A15AC0DF electric_bike 2020-10-31 23:00:01 2020-10-31 23:08:22
## 4 44A4AEE261B9E854 electric_bike 2020-10-31 22:16:43 2020-10-31 22:19:35
## 5 10B7DD76A6A2EB95 electric_bike 2020-10-31 19:38:19 2020-10-31 19:54:32
## 6 DA6C3759660133DA electric_bike 2020-10-29 17:38:04 2020-10-29 17:45:43
## start_station_name start_station_id end_station_name
## 1 Lakeview Ave & Fullerton Pkwy 313 Rush St & Hubbard St
## 2 Southport Ave & Waveland Ave 227 Kedzie Ave & Milwaukee Ave
## 3 Stony Island Ave & 67th St 102 University Ave & 57th St
## 4 Clark St & Grace St 165 Broadway & Sheridan Rd
## 5 Southport Ave & Wrightwood Ave 190 Stave St & Armitage Ave
## 6 Larrabee St & Division St 359 Wells St & Huron St
## end_station_id start_lat start_lng end_lat end_lng customer_type
## 1 125 41.92610 -87.63898 41.89035 -87.62607 casual
## 2 260 41.94817 -87.66391 41.92953 -87.70782 casual
## 3 423 41.77346 -87.58537 41.79145 -87.60005 casual
## 4 256 41.95085 -87.65924 41.95281 -87.65010 casual
## 5 185 41.92886 -87.66396 41.91778 -87.69143 casual
## 6 53 41.90353 -87.64335 41.89440 -87.63431 casual
started_at and ended_at have been tranfered into POSIXct type.
str(bike_rides)
## 'data.frame': 4354488 obs. of 13 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
Create columns for year, month, day, hour; ensure case of “Y/y, M/m, D/d, H/h, M/m, S/s” is correct.
bike_rides$started_year <- format(bike_rides$started_at, "%Y")
str(bike_rides)
## 'data.frame': 4354488 obs. of 14 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2020" "2020" "2020" "2020" ...
bike_rides$started_month <- format(bike_rides$started_at, "%h")
str(bike_rides)
## 'data.frame': 4354488 obs. of 15 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2020" "2020" "2020" "2020" ...
## $ started_month : chr "Oct" "Oct" "Oct" "Oct" ...
bike_rides$day_of_week <- format(bike_rides$started_at, "%a")
str(bike_rides)
## 'data.frame': 4354488 obs. of 16 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2020" "2020" "2020" "2020" ...
## $ started_month : chr "Oct" "Oct" "Oct" "Oct" ...
## $ day_of_week : chr "Sat" "Sat" "Sat" "Sat" ...
Reset day of week levels.
bike_rides$day_of_week <- factor(bike_rides$day_of_week, levels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
bike_rides$started_hour <- format(bike_rides$started_at, "%H")
str(bike_rides)
## 'data.frame': 4354488 obs. of 17 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2020" "2020" "2020" "2020" ...
## $ started_month : chr "Oct" "Oct" "Oct" "Oct" ...
## $ day_of_week : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 7 7 7 7 5 5 5 5 4 ...
## $ started_hour : chr "19" "23" "23" "22" ...
bike_rides$ride_duration <- round(difftime(bike_rides$ended_at, bike_rides$started_at, units = "mins"), 2)
str(bike_rides)
## 'data.frame': 4354488 obs. of 18 variables:
## $ ride_id : chr "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ bike_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct, format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct, format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : chr "313" "227" "102" "165" ...
## $ end_station_name : chr "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "125" "260" "423" "256" ...
## $ start_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2020" "2020" "2020" "2020" ...
## $ started_month : chr "Oct" "Oct" "Oct" "Oct" ...
## $ day_of_week : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 7 7 7 7 5 5 5 5 4 ...
## $ started_hour : chr "19" "23" "23" "22" ...
## $ ride_duration : 'difftime' num 17.48 14.13 8.35 2.87 ...
## ..- attr(*, "units")= chr "mins"
class(bike_rides$ride_duration)
## [1] "difftime"
is.numeric(bike_rides$ride_duration)
## [1] FALSE
I would like convert the ride duration into numeric type, as I will use it to analyze the difference between members and casual users.
bike_rides$ride_duration <- as.numeric(bike_rides$ride_duration)
is.numeric(bike_rides$ride_duration)
## [1] TRUE
skim(bike_rides$ride_duration)
| Name | bike_rides$ride_duration |
| Number of rows | 4354488 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 20.57 | 308.8 | -29049.97 | 7.2 | 12.7 | 22.98 | 55944.15 | ▁▇▁▁▁ |
As we can see that, there is some negative values for ride duration. It means that ended date is earlier than started date. It must be errors of the data set. We have to delete these errors to ensure the following analysis correct. I use filter to delete these errors.
bike_rides_2 <- bike_rides %>%
filter(ride_duration >= 0)
skim(bike_rides_2$ride_duration)
| Name | bike_rides_2$ride_duratio… |
| Number of rows | 4351282 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 22.65 | 190.07 | 0 | 7.22 | 12.7 | 22.98 | 55944.15 | ▇▁▁▁▁ |
There is no neagtive values in ride_duration.
#table(bike_rides_2$ride_duration)
From the frequencies of less than 1 minutes, it is fairly represented. As there are hundreds of data points for rides less than 1 minutes, it means that it is not random errors. So I will keep these data points which less than 1 minutes.
There are still data points which are larger than 1440 minutes (1 day). The maximum rides is 58720.03 minutes, it means 40 days. It seems that it is impossible. I will treat these outliers as errors and should be deleted. I will delete the values which is tested as outliers.
#install.packages("EnvStats")
library(EnvStats)
##
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
outliers <- rosnerTest(bike_rides_2$ride_duration, k = 10)
outliers
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: bike_rides_2$ride_duration
##
## Sample Size: 4351282
##
## Test Statistics: R.1 = 294.2206
## R.2 = 295.8499
## R.3 = 289.3681
## R.4 = 285.5782
## R.5 = 268.6228
## R.6 = 263.5360
## R.7 = 261.4041
## R.8 = 233.3958
## R.9 = 234.7783
## R.10 = 231.1427
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 10
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 22.65300 190.0666 55944.15 2261412 294.2206 5.707102 TRUE
## 2 1 22.64014 188.1665 55691.68 2214891 295.8499 5.707102 TRUE
## 3 2 22.62735 186.2644 53921.60 1345521 289.3681 5.707102 TRUE
## 4 3 22.61496 184.4635 52701.38 1824058 285.5782 5.707102 TRUE
## 5 4 22.60286 182.7267 49107.15 2914851 268.6228 5.707102 TRUE
## 6 5 22.59158 181.2053 47776.70 1014997 263.5360 5.707102 TRUE
## 7 6 22.58060 179.7533 47010.85 2216083 261.4041 5.707102 TRUE
## 8 7 22.56980 178.3364 41645.52 2377354 233.3958 5.707102 TRUE
## 9 8 22.56024 177.2166 41629.17 3206609 234.7783 5.707102 TRUE
## 10 9 22.55068 176.0905 40724.60 2677751 231.1427 5.707102 TRUE
There are many outliers presented above. I will use IQR method to remove outliers.
bike_rides_2 <- arrange(bike_rides_2, desc(ride_duration))
In the column of ride. duration, there are many values which are over 1440 minutes.
quartiles <- quantile(bike_rides_2$ride_duration, probs = c(0.25, 0.75), na.rm = F)
IQR <- IQR(bike_rides_2$ride_duration)
quartiles
## 25% 75%
## 7.22 22.98
IQR
## [1] 15.76
lower <- quartiles[1] -1.5*IQR
upper <- quartiles[2] + 1.5*IQR
lower
## 25%
## -16.42
upper
## 75%
## 46.62
bike_rides_rm_outlier <- subset(bike_rides_2, ride_duration > lower & ride_duration < upper)
outliers <- rosnerTest(bike_rides_rm_outlier$ride_duration, k = 3)
outliers
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: bike_rides_rm_outlier$ride_duration
##
## Sample Size: 4022037
##
## Test Statistics: R.1 = 3.203800
## R.2 = 3.203805
## R.3 = 3.203809
##
## Test Statistic Parameter: k = 3
##
## Alternative Hypothesis: Up to 3 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 14.46386 10.03687 46.62 1 3.203800 5.693689 FALSE
## 2 1 14.46385 10.03686 46.62 2 3.203805 5.693689 FALSE
## 3 2 14.46384 10.03685 46.62 3 3.203809 5.693689 FALSE
The Rosner’s test result show that there are no significant outliers in our trimmed data set. However, I would like to take a look the data which has been deleted to check if it is reasonable to remove them. So I create several datasets to compare the proportion of members and casuals.
bike_less_47 <- bike_rides_rm_outlier
bike_test <- bike_rides_2 %>%
mutate(duration_bins
= case_when(ride_duration >=0 & ride_duration<=60 ~ "0_1_h",
ride_duration >60 & ride_duration<=120 ~ "1_2_h",
ride_duration >120 & ride_duration<=180 ~ "2_3_h",
ride_duration >180 & ride_duration<=240 ~ "3_4_h",
ride_duration >240 & ride_duration<=300 ~ "4_5_h",
ride_duration >300 & ride_duration<=360 ~ "5_6_h",
ride_duration >360 & ride_duration<=420 ~ "6_7_h",
ride_duration >420 & ride_duration<=480 ~ "7_8_h",
ride_duration >480 ~ "8_h_plus"))
str(bike_test)
## 'data.frame': 4351282 obs. of 19 variables:
## $ ride_id : chr "F043F0F6A1AA4F85" "7F0578ABF030FC83" "BDA1217EC8532C7B" "E6E1E5BFFD72ECEC" ...
## $ bike_type : chr "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : POSIXct, format: "2021-06-05 02:27:26" "2021-06-04 22:03:33" ...
## $ ended_at : POSIXct, format: "2021-07-13 22:51:35" "2021-07-13 14:15:14" ...
## $ start_station_name: chr "Michigan Ave & Lake St" "Streeter Dr & Grand Ave" "State St & Van Buren St" "Clark St & Lake St" ...
## $ start_station_id : chr "TA1305000011" "13022" "TA1305000035" "KA1503000012" ...
## $ end_station_name : chr "Malcolm X College Vaccination Site" "Base - 2132 W Hubbard Warehouse" "Wells St & Polk St" "Base - 2132 W Hubbard Warehouse" ...
## $ end_station_id : chr "631" "Hubbard Bike-checking (LBS-WH-TEST)" "SL-011" "Hubbard Bike-checking (LBS-WH-TEST)" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
## $ started_year : chr "2021" "2021" "2021" "2021" ...
## $ started_month : chr "Jun" "Jun" "May" "Jun" ...
## $ day_of_week : Factor w/ 7 levels "Sun","Mon","Tue",..: 7 6 1 7 5 6 7 5 1 7 ...
## $ started_hour : chr "02" "22" "02" "23" ...
## $ ride_duration : num 55944 55692 53922 52701 49107 ...
## $ duration_bins : chr "8_h_plus" "8_h_plus" "8_h_plus" "8_h_plus" ...
bike_over_47 <- bike_rides_2 %>%
filter(ride_duration > 46.62)
bike_47_12h <- bike_rides_2 %>%
filter(ride_duration > 46.62 & ride_duration <= 720)
nrow(bike_47_12h)
## [1] 325280
There are 325280 records where ride duration is between 46.62 minutes and 720 minutes (12 hour)
bike_47_8h <- bike_rides_2 %>%
filter(ride_duration > 46.62 & ride_duration <= 480)
nrow(bike_47_8h)
## [1] 323723
bike_over_12h <- bike_over_47 %>%
filter(ride_duration > 720)
nrow(bike_over_12h)
## [1] 3965
There are 3965 records where ride duration are over 12 hour.
bike_over_day <- bike_over_47 %>%
filter(ride_duration > 1440)
nrow(bike_over_day)
## [1] 1389
There are only 1389 records where ride duration are over a day.
bike_less_47 %>%
group_by(customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = "", y=count, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 1)) +
ggtitle("Rides Per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size =20)) +
geom_text(aes(label = paste(customer_type, " ", round(count / sum(count) * 100, 1), "%")), position = position_stack(vjust = 0.5), size=5) +
theme(legend.position = "none")
#### Checking the outliers distribution
bike_over_47 %>%
group_by(customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = "", y=count, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 1)) +
ggtitle("Rides Per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(round(count / sum(count) * 100, 1), "%")), position = position_stack(vjust = 0.5), size=5)
bike_47_12h %>%
group_by(customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = "", y = count, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 1, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 1)) +
ggtitle("Rides per Customer Type (Over 47 and less 12h)") +
theme(plot.title = element_text(hjust = 0.4, vjust = 0.5, size =20, face = "bold")) +
geom_text(aes(label = paste(round(count / sum(count)*100, 1), "%")), position = position_stack(vjust = 0.5), size = 5)
bike_over_day %>%
group_by(customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x="", y = count, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust =1)) +
ggtitle("Rides per Customer Type (Over a Day)") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(round(count / sum(count)*100, 1), "%")), position = position_stack(vjust = 0.5), size =5)
bike_test %>%
filter(!duration_bins == "0_1_h" & !duration_bins == "1_2_h") %>%
group_by(customer_type) %>%
ggplot(aes(x = duration_bins, fill = customer_type)) +
geom_bar()
bike_test_5h <- bike_test %>%
filter(duration_bins =="4_5_h")
As we can see that, from 4 hour, the rides number drop sharply. Between 240 and 300 minutes, there are only 2785 rides. So, I think it is reasonable to delete outlier which are longer than 300 minutes, rather than IQR. With IQR, it will remove over 300,000 rides and over 90% of the outliers are casuals. If I remove these values which are over 46.62 minutes, I may lose the chance to discover the casual customer’s behavior.
With this, we can then proceed with the succeeding statistical analysis using our trimmed data.
bike_rides_less_5h <- bike_rides_2 %>%
filter(ride_duration <= 300)
rosnerTest(bike_rides_less_5h$ride_duration, k = 10)
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: bike_rides_less_5h$ride_duration
##
## Sample Size: 4343050
##
## Test Statistics: R.1 = 12.10182
## R.2 = 12.10159
## R.3 = 12.10007
## R.4 = 12.09898
## R.5 = 12.09832
## R.6 = 12.09810
## R.7 = 12.09658
## R.8 = 12.09635
## R.9 = 12.09569
## R.10 = 12.09503
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 10
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 19.55842 23.16772 299.93 1 12.10182 5.70678 TRUE
## 2 1 19.55835 23.16734 299.92 2 12.10159 5.70678 TRUE
## 3 2 19.55829 23.16695 299.88 3 12.10007 5.70678 TRUE
## 4 3 19.55822 23.16656 299.85 4 12.09898 5.70678 TRUE
## 5 4 19.55816 23.16617 299.83 5 12.09832 5.70678 TRUE
## 6 5 19.55809 23.16579 299.82 6 12.09810 5.70678 TRUE
## 7 6 19.55803 23.16540 299.78 7 12.09658 5.70678 TRUE
## 8 7 19.55797 23.16501 299.77 8 12.09635 5.70678 TRUE
## 9 8 19.55790 23.16462 299.75 9 12.09569 5.70678 TRUE
## 10 9 19.55784 23.16423 299.73 10 12.09503 5.70678 TRUE
Even the outlier check shows that there are some outliers, I won’t remove them based on the previous analysis.
skim(bike_rides_less_5h$ride_duration)
| Name | bike_rides_less_5h$ride_d… |
| Number of rows | 4343050 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 19.56 | 23.17 | 0 | 7.2 | 12.68 | 22.9 | 299.93 | ▇▁▁▁▁ |
bike_rides_less_5h$ride_id[duplicated(bike_rides_less_5h$ride_id)]
## character(0)
There is no duplicates in ride_id.
Now, I have finished the data cleaning and transforming. I will start to analyze the data in the following part.
bike_cleaned <- bike_rides_less_5h
skim(bike_cleaned)
| Name | bike_cleaned |
| Number of rows | 4343050 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| factor | 1 |
| numeric | 5 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 4343050 | 0 |
| bike_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 0 | 1 | 3 | 53 | 0 | 781 | 0 |
| start_station_id | 259 | 1 | 1 | 36 | 0 | 1296 | 0 |
| end_station_name | 0 | 1 | 10 | 53 | 0 | 776 | 0 |
| end_station_id | 194 | 1 | 1 | 36 | 0 | 1294 | 0 |
| customer_type | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| started_year | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| started_month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| started_hour | 0 | 1 | 2 | 2 | 0 | 24 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| day_of_week | 0 | 1 | FALSE | 7 | Sat: 789472, Sun: 675808, Fri: 623443, Thu: 583858 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.53 | ▁▁▅▇▁ |
| end_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.17 | ▁▁▇▁▁ |
| end_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▆▇▁ |
| ride_duration | 0 | 1 | 19.56 | 23.17 | 0.00 | 7.20 | 12.68 | 22.90 | 299.93 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-10-01 00:00:06 | 2021-09-30 23:59:44 | 2021-06-19 19:51:09 | 3724611 |
| ended_at | 0 | 1 | 2020-10-01 00:05:09 | 2021-10-01 01:23:49 | 2021-06-19 20:16:56 | 3712429 |
bike_cleaned_drd <- arrange(bike_cleaned, desc(ride_duration))
summary(bike_cleaned$ride_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.20 12.68 19.56 22.90 299.93
bike_cleaned_2 <- bike_cleaned %>%
mutate(duration_bins
= case_when(ride_duration >=0 & ride_duration<=5 ~ "0_5_mins",
ride_duration >5 & ride_duration<=10 ~ "5_10_mins",
ride_duration >10 & ride_duration<=15 ~ "10_15_mins",
ride_duration >15 & ride_duration<=20 ~ "15_20_mins",
ride_duration >20 & ride_duration<=25 ~ "20_25_mins",
ride_duration >25 & ride_duration<=30 ~ "25_30_mins",
ride_duration >30 & ride_duration<=35 ~ "30_35_mins",
ride_duration >35 & ride_duration<=40 ~ "35_40_mins",
ride_duration >40 & ride_duration<=45 ~ "40_45_mins",
ride_duration >45 & ride_duration<=50 ~ "45_50_mins",
ride_duration >50 & ride_duration<=60 ~ "50_60_mins",
ride_duration >60 & ride_duration<=90 ~ "60_90_mins",
ride_duration >90 & ride_duration<=150 ~ "90_150_mins",
ride_duration >150 ~ "over_150_mins"))
skim(bike_cleaned_2)
| Name | bike_cleaned_2 |
| Number of rows | 4343050 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| factor | 1 |
| numeric | 5 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 4343050 | 0 |
| bike_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 0 | 1 | 3 | 53 | 0 | 781 | 0 |
| start_station_id | 259 | 1 | 1 | 36 | 0 | 1296 | 0 |
| end_station_name | 0 | 1 | 10 | 53 | 0 | 776 | 0 |
| end_station_id | 194 | 1 | 1 | 36 | 0 | 1294 | 0 |
| customer_type | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| started_year | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| started_month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| started_hour | 0 | 1 | 2 | 2 | 0 | 24 | 0 |
| duration_bins | 0 | 1 | 8 | 13 | 0 | 14 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| day_of_week | 0 | 1 | FALSE | 7 | Sat: 789472, Sun: 675808, Fri: 623443, Thu: 583858 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.06 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.53 | ▁▁▅▇▁ |
| end_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.90 | 41.93 | 42.17 | ▁▁▇▁▁ |
| end_lng | 0 | 1 | -87.64 | 0.02 | -87.83 | -87.66 | -87.64 | -87.63 | -87.52 | ▁▁▆▇▁ |
| ride_duration | 0 | 1 | 19.56 | 23.17 | 0.00 | 7.20 | 12.68 | 22.90 | 299.93 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-10-01 00:00:06 | 2021-09-30 23:59:44 | 2021-06-19 19:51:09 | 3724611 |
| ended_at | 0 | 1 | 2020-10-01 00:05:09 | 2021-10-01 01:23:49 | 2021-06-19 20:16:56 | 3712429 |
colnames(bike_cleaned_2)
## [1] "ride_id" "bike_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "customer_type" "started_year" "started_month"
## [16] "day_of_week" "started_hour" "ride_duration"
## [19] "duration_bins"
head(bike_cleaned_2)
## ride_id bike_type started_at ended_at
## 1 E5CF6494F2CF5635 classic_bike 2021-05-02 17:00:30 2021-05-02 22:00:26
## 2 8C616C19B29AC3F9 docked_bike 2021-09-06 13:33:38 2021-09-06 18:33:33
## 3 71CD7586C02F9D58 docked_bike 2021-09-04 10:41:12 2021-09-04 15:41:05
## 4 5C2A0EBF288CF749 classic_bike 2021-07-27 12:35:56 2021-07-27 17:35:47
## 5 BA59BFC7426B7D5E docked_bike 2021-07-23 12:15:39 2021-07-23 17:15:29
## 6 531BEF5239B4D05C classic_bike 2021-06-29 16:13:09 2021-06-29 21:12:58
## start_station_name start_station_id
## 1 Fort Dearborn Dr & 31st St TA1307000048
## 2 DuSable Lake Shore Dr & North Blvd LF-005
## 3 DuSable Lake Shore Dr & North Blvd LF-005
## 4 Clarendon Ave & Gordon Ter 13379
## 5 Streeter Dr & Grand Ave 13022
## 6 Rush St & Superior St 15530
## end_station_name end_station_id
## 1 McClurg Ct & Erie St KA1503000041
## 2 DuSable Lake Shore Dr & North Blvd LF-005
## 3 Orleans St & Merchandise Mart Plaza TA1305000022
## 4 Clarendon Ave & Gordon Ter 13379
## 5 State St & Randolph St TA1305000029
## 6 Base - 2132 W Hubbard Warehouse Hubbard Bike-checking (LBS-WH-TEST)
## start_lat start_lng end_lat end_lng customer_type started_year
## 1 41.83856 -87.60822 41.89450 -87.61785 casual 2021
## 2 41.91172 -87.62680 41.91172 -87.62680 casual 2021
## 3 41.91172 -87.62680 41.88824 -87.63639 casual 2021
## 4 41.95787 -87.64951 41.95787 -87.64951 member 2021
## 5 41.89228 -87.61204 41.88468 -87.62798 casual 2021
## 6 41.89576 -87.62591 41.88995 -87.68065 member 2021
## started_month day_of_week started_hour ride_duration duration_bins
## 1 May Sun 17 299.93 over_150_mins
## 2 Sep Mon 13 299.92 over_150_mins
## 3 Sep Sat 10 299.88 over_150_mins
## 4 Jul Tue 12 299.85 over_150_mins
## 5 Jul Fri 12 299.83 over_150_mins
## 6 Jun Tue 16 299.82 over_150_mins
tail(bike_cleaned_2)
## ride_id bike_type started_at ended_at
## 4343045 7E8BDEE052DDB6B9 classic_bike 2021-09-28 17:39:07 2021-09-28 17:39:07
## 4343046 1C00E7445599C55A classic_bike 2021-09-18 20:26:23 2021-09-18 20:26:23
## 4343047 AB0227900FBBA279 classic_bike 2021-09-04 15:53:04 2021-09-04 15:53:04
## 4343048 84D8C8E8FDCBBFB0 electric_bike 2021-09-12 17:29:00 2021-09-12 17:29:00
## 4343049 0C32DAA0DDCAA66B electric_bike 2021-09-05 15:16:10 2021-09-05 15:16:10
## 4343050 8590ABC3CD2BBC35 classic_bike 2021-09-22 18:51:46 2021-09-22 18:51:46
## start_station_name start_station_id
## 4343045 Halsted St & Clybourn Ave 331
## 4343046 Racine Ave & Fullerton Ave TA1306000026
## 4343047 DuSable Lake Shore Dr & North Blvd LF-005
## 4343048 Wabash Ave & 9th St TA1309000010
## 4343049 Sheffield Ave & Wrightwood Ave TA1309000023
## 4343050 Sedgwick St & Webster Ave 13191
## end_station_name end_station_id start_lat start_lng
## 4343045 Halsted St & Clybourn Ave 331 41.90967 -87.64813
## 4343046 Racine Ave & Fullerton Ave TA1306000026 41.92556 -87.65840
## 4343047 DuSable Lake Shore Dr & North Blvd LF-005 41.91172 -87.62680
## 4343048 Wabash Ave & 9th St TA1309000010 41.87077 -87.62580
## 4343049 Sheffield Ave & Wrightwood Ave TA1309000023 41.92864 -87.65378
## 4343050 Sedgwick St & Webster Ave 13191 41.92217 -87.63889
## end_lat end_lng customer_type started_year started_month day_of_week
## 4343045 41.90967 -87.64813 casual 2021 Sep Tue
## 4343046 41.92556 -87.65840 member 2021 Sep Sat
## 4343047 41.91172 -87.62680 member 2021 Sep Sat
## 4343048 41.87083 -87.62583 member 2021 Sep Sun
## 4343049 41.92865 -87.65378 casual 2021 Sep Sun
## 4343050 41.92217 -87.63889 member 2021 Sep Wed
## started_hour ride_duration duration_bins
## 4343045 17 0 0_5_mins
## 4343046 20 0 0_5_mins
## 4343047 15 0 0_5_mins
## 4343048 17 0 0_5_mins
## 4343049 15 0 0_5_mins
## 4343050 18 0 0_5_mins
We generated a column to categorize the year, the month, days of the week, and the hour. The ride duration, measured in minutes, was subsequently calculated.
To identify outlier data, we utilized the Rosner Test.
The upper and lower limits of data inclusion were computed using the following formulas:
When I attempted to exclude outliers using the Interquartile Range (IQR) method, I observed an imbalance in the removed values, with a disproportionate number of casual values being affected. Upon examining the data for ride durations longer than 46.62 minutes (the threshold for the upper limit), I discovered over 300,000 rows, with 90% of them being casual rides. Therefore, I made the decision to delete only the rides lasting more than 300 minutes. This approach seemed reasonable, as it is not uncommon for people to rent bikes for periods of less than 5 hours.
The rides were classified into bins in increments of minutes for ease of analysis.
# options(max.print=1000000000)
H1: Members use bikes more often than casuals.
bike_cleaned_2 %>%
group_by(customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = "", y=count, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 1)) +
ggtitle("Rides Percent per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(customer_type, " ", round(count / sum(count) * 100, 1), "%")), position = position_stack(vjust = 0.5), size=5) +
theme(legend.position = "none")
Member riders account for 54.9% while casual riders account for
45.1%.
ggplot(bike_cleaned_2, aes(x = customer_type, fill = customer_type)) +
geom_bar(position = "dodge", width = 0.6) +
scale_y_continuous(name = "Number of Rides", labels = function(x) format(x, big.mard = ",", scientific = F)) +
theme_classic() +
labs(x = "Customer Type") +
coord_cartesian(ylim = c(100000, 2500000)) +
ggtitle("Total Rides per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size =20)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.5, color = "black", size = 5)
Member riders took more than 2 million trips while casual riders took
more than 1.95 million rides.
I use chi-square test to examine the difference of these two groups frequencies.
chisq.test(table(bike_cleaned_2$customer_type))
##
## Chi-squared test for given probabilities
##
## data: table(bike_cleaned_2$customer_type)
## X-squared = 42487, df = 1, p-value < 2.2e-16
Chi-square test result show that p-value is less than 0.05, means that there is a significant difference between members and casuals rides. There was evidence that at the 5% level, the hypothesis being tested was correct, it is that members use bikes more often than casuals. The hypothesis H1 was supported.
bike_cleaned_2 %>%
group_by(bike_type) %>%
summarise(bike_rides = n()) %>%
ggplot(aes(x = "", y = bike_rides, fill = bike_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
ggtitle("Rides Percent Per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(round(bike_rides / sum(bike_rides)*100, 1), "%")), position = position_stack(vjust = 0.5), size = 5)
Based on the data I selected, classic bikes account for 63.1% of the total rides, electric bikes account for 21.5%, and docked bikes account for 45.1%. Since we lack sufficient information, it’s challenging to determine definitively whether this distribution reflects customer preference or the company’s strategy in launching more classic bikes than the other two options.
ggplot(bike_cleaned_2, aes(x = bike_type, fill = bike_type)) +
geom_bar(position = "dodge", width = 0.6) +
scale_y_continuous(name = "Number of Rides", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(100000, 2800000)) +
theme_classic() +
labs(x = "Bike Type") +
ggtitle("Total Rides per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = ..count..), stat = "count", vjust = -0.5, color = "black", size = 5)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Based on the data I selected, classic bikes reach rides to 2,740,198, docked bikes hit 669,131 rides, and electric bikes hit 933,730 rides.
bike_cleaned_2 %>%
group_by(customer_type) %>%
summarise(customer_sum = sum(ride_duration)) %>%
ggplot(aes(x = "", y = customer_sum, fill = customer_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
ggtitle("Ride Duration Percent per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, size =20, face = "bold")) +
geom_text(aes(label = paste(customer_type, " ", round(customer_sum / sum(customer_sum)*100, 1), "%")), position = position_stack(vjust = 0.5),size = 5)
The proportion of ride duration for casual users has dramatically
increased to 62.2% compared to the proportion of rides (45.1%). This
suggests that casual users tend to use bikes for longer duration per
ride than members.
bike_cleaned_2 %>%
group_by(customer_type) %>%
summarise(customer_duration = round(sum(ride_duration), 1)) %>%
ggplot(aes(x = customer_type, y = customer_duration, fill = customer_type, label = customer_duration)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
scale_y_continuous(name = "Ride Duration per Customer Type", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(5000000, 60000000)) +
theme_classic() +
labs(x = "Customer Type") +
ggtitle("Total Ride Duration per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = customer_duration), vjust = -0.5, color = "black", size = 5) +
theme(legend.position = "none")
H2: Members’ average ride duration is higher than casuals’.
aggregate(bike_cleaned_2, ride_duration~customer_type, FUN = mean)
## customer_type ride_duration
## 1 casual 27.02189
## 2 member 13.43846
bike_cleaned_2 %>%
group_by(customer_type) %>%
summarise(mean_duration_c = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = customer_type, y = mean_duration_c, fill = customer_type)) +
geom_col(width = 0.6, position = "dodge") +
theme_classic() +
scale_y_continuous(name = "Mean of Ride Duration") +
coord_cartesian(ylim = c(1, 30)) +
ggtitle("Ride Duration Mean per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), legend.position = "none") +
labs(x = "Customer Type") +
geom_text(aes(label = mean_duration_c), vjust = -1, color = "black", size = 5)
t.test(ride_duration~customer_type, data = bike_cleaned_2)
##
## Welch Two Sample t-test
##
## data: ride_duration by customer_type
## t = 592.12, df = 2488536, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group casual and group member is not equal to 0
## 95 percent confidence interval:
## 13.53847 13.62839
## sample estimates:
## mean in group casual mean in group member
## 27.02189 13.43846
Based on the very low p-value and the confidence interval that does not include 0, the t-test result indicates a highly significant difference in ride duration between the “casual” and “member” customer types. H2: Members’ average ride duration is higher than casuals, was not supported. Casual users’ average ride duration is longer than members.
aggregate(bike_cleaned_2, ride_duration~customer_type, FUN = median)
## customer_type ride_duration
## 1 casual 17.13
## 2 member 10.10
bike_cleaned_2 %>%
group_by(customer_type) %>%
summarise(median_duration_c = median(ride_duration)) %>%
ggplot(aes(x = customer_type, y = median_duration_c, fill = customer_type)) +
theme_classic() +
geom_col(width = 0.6, position = "dodge") +
scale_y_continuous(name = "Median of Ride Duration") +
coord_cartesian(ylim = c(1, 20)) +
labs(x = "Customer Type") +
ggtitle("Ride Duration Median per Customer Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), legend.position = "none") +
geom_text(aes(label = median_duration_c), vjust = -1, color = "black", size = 5)
wilcox.test(bike_cleaned_2$ride_duration~bike_cleaned_2$customer_type)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bike_cleaned_2$ride_duration by bike_cleaned_2$customer_type
## W = 3.1832e+12, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank sum test result suggests strong evidence of a significant difference in the medians of ride duration between the two customer types (“casual” and “member”).
aggregate(bike_cleaned_2, ride_duration~customer_type, FUN = max)
## customer_type ride_duration
## 1 casual 299.93
## 2 member 299.85
bike_sum_duration <- bike_cleaned_2 %>%
group_by(bike_type) %>%
summarise(sum_duration = sum(ride_duration))
bike_sum_duration
## # A tibble: 3 × 2
## bike_type sum_duration
## <chr> <dbl>
## 1 classic_bike 50013721.
## 2 docked_bike 19583950.
## 3 electric_bike 15345511.
ggplot(bike_sum_duration, aes(x = "", y = sum_duration, fill = bike_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.title = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 1)) +
ggtitle("Ride Duration Percent per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(round(sum_duration / sum(sum_duration)*100, 1), "%")), position = position_stack(vjust = 0.4), size = 5)
The classic bike account for 58.9% of the total ride duration, docked bike account for 23.1%, and electric bike account for 18.1%.
bike_cleaned_2 %>%
group_by(bike_type) %>%
summarise(bike_type_duration= round(sum(ride_duration), 1)) %>%
ggplot(aes(x = bike_type, y = bike_type_duration, fill = bike_type, label = bike_type_duration)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
scale_y_continuous(name = "Ride Duration per Bike Type", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(5000000, 55000000)) +
theme_classic() +
ggtitle("Total Ride Duration per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face= "bold", size = 20), legend.position = "none") +
geom_text(vjust = -0.5, color = "black", size = 5)
Based on the data, it’s evident that classic bikes have the highest
total ride duration, surpassing 50 million minutes. Docked bikes have a
ride duration of around 19 million minutes, while electric bikes have
achieved over 15 million minutes of ride duration. However, we can’t
definitively determine which bike type is the most popular, as the ride
duration is influenced by the number of bikes the company introduced in
the area. Utilizing the ride duration mean could potentially serve as a
more reliable indicator of bike popularity.
aggregate(bike_cleaned_2, ride_duration~bike_type, FUN = mean)
## bike_type ride_duration
## 1 classic_bike 18.25192
## 2 docked_bike 29.26774
## 3 electric_bike 16.43463
bike_cleaned_2 %>%
group_by(bike_type) %>%
summarise(mean_duration = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = bike_type, y = mean_duration, fill = bike_type)) +
geom_col(position = "dodge", width = 0.6, color = "white") +
theme_classic() +
coord_cartesian(ylim = c(1, 35)) +
scale_y_continuous(name = "Mean of Ride Duration") +
ggtitle("Ride Duration Mean per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
labs(x = "Bike Type") +
geom_text(aes(label = mean_duration), vjust = -1, color = "black", size = 5) +
theme(legend.position = "none")
#levene_test_result <- leveneTest(ride_duration ~ bike_type, data = bike_cleaned_2)
#levene_test_result
The result shows that the variances are not equal between bike types.
duration_aov <- aov(ride_duration~bike_type, bike_cleaned_2)
summary(duration_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## bike_type 2 7.687e+07 38434108 74048 <2e-16 ***
## Residuals 4343047 2.254e+09 519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value associated with the “bike_type” variable is remarkably low (p < 0.001), indicating a significant impact of bike type on ride duration. This finding suggests that docked bikes are the most popular choice. Further analysis using the TukeyHSD function reveals a significant difference between electric and classic bikes, leading to the conclusion that the classic bike stands as the second most favored option.
tukey_aov<-TukeyHSD(duration_aov)
tukey_aov
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ride_duration ~ bike_type, data = bike_cleaned_2)
##
## $bike_type
## diff lwr upr p adj
## docked_bike-classic_bike 11.015812 10.943002 11.088623 0
## electric_bike-classic_bike -1.817289 -1.881273 -1.753306 0
## electric_bike-docked_bike -12.833102 -12.918625 -12.747578 0
The result shows that there are significant difference between pairwise bike types.
aggregate(bike_cleaned_2, ride_duration~bike_type, FUN = median)
## bike_type ride_duration
## 1 classic_bike 12.42
## 2 docked_bike 17.95
## 3 electric_bike 11.12
The results reveal that docked bikes have the highest median ride duration, while electric bikes have the lowest median ride duration.
bike_cleaned_2 %>%
group_by(bike_type) %>%
summarise(median_duration_b = median(ride_duration)) %>%
ggplot(aes(x = bike_type, y = median_duration_b, fill = bike_type, label = median_duration_b)) +
geom_col(width = 0.6, position = "dodge") +
labs(y = "Median of Ride Duration", x = "Bike Type") +
coord_cartesian(ylim = c(1, 20)) +
theme_classic() +
ggtitle("Ride Duration Median per Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), legend.position = "none") +
geom_text(vjust = -1, color = "black", size = 5)
H3: Member and casuals are significantly different in bike type choice.
bike_cleaned_2 %>%
group_by(bike_type, customer_type) %>%
summarise(total_rides = n()) %>%
ggplot(aes(fill = customer_type, x = bike_type, y = total_rides)) +
geom_col(width = 0.8, position = "stack") +
theme_classic() +
scale_y_continuous(name = "Total Rides", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(100000, 3000000)) +
ggtitle("Customer's Total Rides by Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
labs(x = "Bike Type") +
geom_text(aes(label = total_rides, group = customer_type), position = position_stack(vjust = 0.2, reverse = F), vjust = -0.2, size = 5)
## `summarise()` has grouped output by 'bike_type'. You can override using the
## `.groups` argument.
The results suggest that member users exhibit the highest ride frequency with classic bikes compared to all other segments, while also demonstrating the lowest ride frequency on docked bikes among all segments. In contrast, casual users show a higher ride frequency with classic bikes, although it remains lower than that of member users with classic bikes. Additionally, casual users have the lowest ride frequency with docked bikes, yet this frequency is higher than that of member users with docked bikes.
chisq.test(table(bike_cleaned_2$customer_type, bike_cleaned_2$bike_type))
##
## Pearson's Chi-squared test
##
## data: table(bike_cleaned_2$customer_type, bike_cleaned_2$bike_type)
## X-squared = 84189, df = 2, p-value < 2.2e-16
The results of the chi-square test indicate that the p-value is less than 0.05, signifying a significant distinction between bike types preferred by members and casual users. Specifically, member riders tend to favor classic and electric bikes more than casual users. Additionally, it’s noteworthy that casual users show a preference for docked bikes compared to members. This evidence supports the hypothesis H3, which suggests that there are significant differences in bike type choices between members and casuals.
bike_cleaned_2 %>%
group_by(bike_type, customer_type) %>%
summarise(ride_percent = round(n()/nrow(bike_cleaned_2), 2), "%") %>%
ggplot(aes(x = bike_type, y = ride_percent, fill = customer_type)) +
geom_col(width = 0.6, position = "stack") +
scale_y_continuous(name = "Ride Percent", labels = scales::percent) +
coord_cartesian(ylim = c(0.05, 0.8)) +
theme_classic() +
ggtitle("Customer's Rides Percent by Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = paste(ride_percent*100, "%"), group = customer_type), position = position_stack(vjust = 0, reverse = F), vjust = -1, color = "black", size =5)
## `summarise()` has grouped output by 'bike_type'. You can override using the
## `.groups` argument.
aggregate(bike_cleaned_2, ride_duration~bike_type + customer_type, FUN = mean)
## bike_type customer_type ride_duration
## 1 classic_bike casual 24.96029
## 2 docked_bike casual 39.61202
## 3 electric_bike casual 20.75367
## 4 classic_bike member 13.65749
## 5 docked_bike member 13.70753
## 6 electric_bike member 12.56964
bike_cleaned_2 %>%
group_by(bike_type, customer_type) %>%
summarise(mean_duration_b = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = bike_type, y = mean_duration_b, fill = customer_type)) +
geom_col(width = 0.6, position = "stack") +
scale_y_continuous(name = "Ride duration mean") +
coord_cartesian(ylim = c(0, 40)) +
labs(x = "Bike Type") +
theme_classic() +
ggtitle("Customer's Ride Duration Mean by Bike Type") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face="bold", size = 20)) +
geom_text(aes(label = mean_duration_b), vjust = 2, size = 5)
## `summarise()` has grouped output by 'bike_type'. You can override using the
## `.groups` argument.
The findings reveal that members exhibit comparable average ride durations for classic and docked bikes, with slightly shorter durations for electric bikes. Furthermore, the mean ride durations for all bike types are consistently lower among members in comparison to casual users.
mult_test <- bike_cleaned_2 %>%
group_by(bike_type) %>%
group_modify(
~ t.test(ride_duration ~ customer_type, data = .x, var.equal = TRUE) %>%
broom::tidy()
)
mult_test
## # A tibble: 3 × 11
## # Groups: bike_type [3]
## bike_type estimate estimate1 estimate2 statistic p.value parameter conf.low
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 classic_bike 11.3 25.0 13.7 466. 0 2740187 11.3
## 2 docked_bike 25.9 39.6 13.7 323. 0 669129 25.7
## 3 electric_bi… 8.18 20.8 12.6 225. 0 933728 8.11
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
I use another method to compare the mean difference between member and casual users.
bike_cleaned_2 %>%
group_by(bike_type) %>%
rstatix::t_test(ride_duration ~ customer_type, var.equal = F)
## # A tibble: 3 × 9
## bike_type .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 classic_bike ride_duration casual member 1.11e6 1.63e6 411. 1.42e6 0
## 2 docked_bike ride_duration casual member 4.02e5 2.67e5 382. 5.09e5 0
## 3 electric_bike ride_duration casual member 4.41e5 4.93e5 219. 6.89e5 0
The results of the t-test reveal a noteworthy disparity in means between members and casuals for each bike type. This observation suggests that, irrespective of the specific bike type, casual users generally exhibit longer ride duration compared to members. Given this compelling finding, Cyclistic could benefit from revisiting their service fee policy. Implementing a service fee structure based on ride duration might be a strategic approach to enhance revenue generation. Such a policy adjustment aligns with the observed usage patterns leading to increased revenue for the company.
As my data is from October of 2020 to September to 2021, so I arrange months levels from October as a start.
bike_cleaned_2$started_month <- factor(bike_cleaned_2$started_month, levels = c("Oct", "Nov", "Dec", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"))
bike_cleaned_2$started_y_month <- format(bike_cleaned_2$started_at, "%Y-%m")
H4: Due to the weather in Chicago, both members and casuals prefer to use bikes in summer than winter.
H5: Members’ total rides per month are higher than casuals.
bike_cleaned_2 %>%
group_by(started_y_month, customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = started_y_month, y = count, color = customer_type, group = customer_type)) +
geom_line(linewidth =1) +
geom_point(size = 2) +
scale_y_continuous(name = "Total Rides", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(10000, 400000)) +
labs(x = "Month") +
theme_classic() +
ggtitle("Customer's Total Rides by Month") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), axis.text.x = element_text(angle = -45, hjust =0))
## `summarise()` has grouped output by 'started_y_month'. You can override using
## the `.groups` argument.
chisq.test(table(bike_cleaned_2$started_y_month, bike_cleaned_2$customer_type))
##
## Pearson's Chi-squared test
##
## data: table(bike_cleaned_2$started_y_month, bike_cleaned_2$customer_type)
## X-squared = 130000, df = 11, p-value < 2.2e-16
Chi-square test result show that p-value is less than 0.05, means that there is a significant difference between members and casuals rides. There was evidence that at the 5% level, months is a factor which can impact total rides significantly.
The data suggests a preference for biking during the months of May through September, with lower activity observed in January and February. Casual riders demonstrated their peak usage in July, while members reached their highest point in August. It’s worth noting that member ridership consistently exceeded that of casual riders except in July and August. The fourth hypothesis, H4: due to the weather in Chicago, both members and casuals prefer to use bikes in summer rather than winter, finds support in the data. Consequently, launching marketing campaigns in spring to encourage bike usage would be advantageous for Cyclistic Company.
However, the fifth hypothesis, H5: Members’ total rides per month are higher than casuals, is not supported by the data. While the overall trend shows casual users surpassing members in July and August, the differences in May, June, and September are relatively small, suggests that Cyclistic Company should not solely focus on member users. Implementing campaigns to encourage both casual and member ridership could potentially yield greater benefits for the company.
H6: Members’ total ride duration per month is higher than casuals.
bike_cleaned_2 %>%
group_by(started_y_month, customer_type) %>%
summarise(total_duration_m = sum(ride_duration)) %>%
ggplot(aes(x = started_y_month, y = total_duration_m, color = customer_type, group = customer_type)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
scale_y_continuous(name = "Total ride duration", labels = function(x) format(x, bigmark = ",", scientific = F)) +
coord_cartesian(ylim = c(5000, 11000000)) +
labs(x = "Month") +
theme_classic()+
ggtitle("Customer's Total Ride Duration by Month") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), axis.text.x = element_text(angle = -45, hjust = 0))
## `summarise()` has grouped output by 'started_y_month'. You can override using
## the `.groups` argument.
By the chart above, it’s evident that casual users have higher total ride duration throughout the year, with the exception of November, December, January, and February. This pattern suggests that, apart from the winter months, casual riders maintain longer ride duration compared to members during other seasons. The sixth hypothesis, which posited that members’ total ride duration per month is greater than that of casuals, did not find support in the data. This reinforces the need for Cyclistic to formulate strategies aimed at expanding the base of casual users.
bike_cleaned_2 %>%
group_by(started_y_month, customer_type) %>%
summarise(mean_duration_m = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = started_y_month, y = mean_duration_m, fill = customer_type)) +
geom_col(width = 0.8) +
labs(x = "Month", y = "Mean of Ride Duration") +
coord_cartesian(ylim = c(0, 40)) +
theme_classic() +
ggtitle("Ride Duration Mean per Month") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size =20), axis.text.x = element_text(angle = -45)) +
geom_text(aes(label = mean_duration_m), vjust = 2, size = 4)
## `summarise()` has grouped output by 'started_y_month'. You can override using
## the `.groups` argument.
The chart above clearly illustrates that casual users consistently have a longer mean ride duration compared to members throughout the entire year.
bike_cleaned_2 %>%
group_by(started_y_month) %>%
rstatix::t_test(ride_duration ~ customer_type)
## # A tibble: 12 × 9
## started_y_month .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 2020-10 ride_… casual member 122030 215059 146. 1.47e5 0
## 2 2020-11 ride_… casual member 72685 149122 121. 8.41e4 0
## 3 2020-12 ride_… casual member 24417 89023 58.9 2.68e4 0
## 4 2021-01 ride_… casual member 14632 68804 40.4 1.60e4 0
## 5 2021-02 ride_… casual member 8546 34343 36.8 9.45e3 1.73e-277
## 6 2021-03 ride_… casual member 75307 130005 137. 8.87e4 0
## 7 2021-04 ride_… casual member 119880 177713 160. 1.48e5 0
## 8 2021-05 ride_… casual member 215773 234071 219. 2.75e5 0
## 9 2021-06 ride_… casual member 302704 304485 233. 3.97e5 0
## 10 2021-07 ride_… casual member 368172 322789 244. 4.99e5 0
## 11 2021-08 ride_… casual member 340498 332806 227. 4.63e5 0
## 12 2021-09 ride_… casual member 292100 328086 210. 3.85e5 0
The results of the t-test reveal a noteworthy disparity in means between members and casuals for each month. This observation suggests that, irrespective of the specific month, casual users generally exhibit longer ride duration compared to members. This indicates that casual riders tend to use bikes for longer periods than members across all of the year. Members exhibit greater ride duration stability throughout the year with minimal fluctuations. Casual users experience their shortest ride duration on January (20 minutes) and their longest on March (29.6 minutes).
H7: Members are more active on weekdays rather than weekends, casual riders are highly active on weekends.
bike_cleaned_2 %>%
group_by(day_of_week, customer_type) %>%
summarise(count = n()) %>%
ggplot(aes(x = day_of_week, y = count, fill = customer_type)) +
geom_col(width = 0.8) +
scale_y_continuous(name = "Total Rides", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(50000, 1000000)) +
labs(x = "Day of Week") +
theme_classic() +
ggtitle("Customer Total Rides by Day of Week") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size =20)) +
geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 3.5)
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
chisq.test(table(bike_cleaned_2$day_of_week, bike_cleaned_2$customer_type))
##
## Pearson's Chi-squared test
##
## data: table(bike_cleaned_2$day_of_week, bike_cleaned_2$customer_type)
## X-squared = 128377, df = 6, p-value < 2.2e-16
Chi-square test result show that p-value is less than 0.05, means that there is a significant difference between members and casuals rides by the day of week. There was evidence that at the 5% level, day of week is a factor which can impact total rides significantly.
The data suggests that casual users prefer to use bike on weekend, while members love to use bike during weekdays. On weekdays, the total rides of members are higher than casual users. And, on weekend, the total rides of casual users are higher than members. Casual riders demonstrated their peak usage Saturday, while members reached their highest point on Wednesday. It’s worth noting that member ridership consistently exceeded that of casual riders except Sunday and Saturday. The seventh hypothesis, H7: members are more active on weekdays rather than weekends, casual riders are highly active on weekends, finds support in the data. Consequently, launching marketing campaigns on weekend for casual users and during weekdays for members would be advantageous for Cyclistic Company.
H8: Members’ total ride duration is higher than casuals on weekdays while lower than casuals on weekends.
bike_cleaned_2 %>%
group_by(day_of_week, customer_type) %>%
summarise(total_duration_d = round(sum(ride_duration), 1)) %>%
ggplot(aes(x = day_of_week, y = total_duration_d, fill = customer_type)) +
geom_col(width = 0.95) +
scale_y_continuous(name = "Total ride duration", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(50000, 20000000)) +
scale_x_discrete(expand = c(0.1, 0.1)) +
labs(x = "Day of Week") +
theme_classic() +
ggtitle("Customers' Total Ride Duration per Day of Week") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.8, face = "bold", size =20), legend.position = "none") +
geom_text(aes(label = total_duration_d), position = position_stack(vjust = 0.5), vjust =1, size = 4)
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
By the chart above, it’s evident that casual users have higher total ride duration throughout all days of week. This pattern suggests that casual riders maintain longer ride duration compared to members during every day of week. The eighth hypothesis H8, which posited that members’ total ride duration is higher than casuals on weekdays while lower than casuals on weekends, did not find support in the data. This reinforces the need for Cyclistic to formulate strategies aimed at casual users as they spend more time on Cyclistic’s product.
H9: Members’ average ride duration is higher than casuals on weekdays while lower than casuals on weekends.
aggregate(bike_cleaned_2, ride_duration~ day_of_week + customer_type, FUN = mean)
## day_of_week customer_type ride_duration
## 1 Sun casual 31.00587
## 2 Mon casual 27.37242
## 3 Tue casual 24.63781
## 4 Wed casual 23.67308
## 5 Thu casual 23.06321
## 6 Fri casual 25.04076
## 7 Sat casual 29.29495
## 8 Sun member 15.29003
## 9 Mon member 12.94820
## 10 Tue member 12.70576
## 11 Wed member 12.75888
## 12 Thu member 12.64609
## 13 Fri member 13.07892
## 14 Sat member 14.98542
bike_cleaned_2 %>%
group_by(day_of_week, customer_type) %>%
summarise(mean_duration_d = round(mean(ride_duration), 1)) %>%
arrange(day_of_week) %>%
ggplot(aes(x = day_of_week, y = mean_duration_d, fill = customer_type)) +
geom_col(width = 0.6, position = "stack") +
scale_y_continuous(name = "Mean of Ride Duration") +
coord_cartesian(ylim = c(5, 50)) +
labs(x = "Day of Week") +
theme_classic() +
ggtitle("Customer's Ride Duration Mean per Day of Week") +
theme(plot.title = element_text(hjust = 0.3, vjust = 0.5, face = "bold", size = 20)) +
geom_text(aes(label = mean_duration_d), vjust = 2, size = 5)
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
bike_cleaned_2$combined_type <- interaction(bike_cleaned_2$bike_type, bike_cleaned_2$customer_type, drop = TRUE)
#leveneTest(ride_duration ~ combined_type, data = bike_cleaned_2)
The variance between customer type and bike type is not equal.
bike_cleaned_2 %>%
group_by(day_of_week) %>%
rstatix::t_test(ride_duration ~ customer_type)
## # A tibble: 7 × 9
## day_of_week .y. group1 group2 n1 n2 statistic df p
## * <fct> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Sun ride_duration casual member 380262 295546 264. 542431. 0
## 2 Mon ride_duration casual member 218286 323979 212. 263087. 0
## 3 Tue ride_duration casual member 202939 351276 182. 241738. 0
## 4 Wed ride_duration casual member 208005 365984 175. 249956. 0
## 5 Thu ride_duration casual member 221885 361973 174. 270498. 0
## 6 Fri ride_duration casual member 277730 345713 209. 358024. 0
## 7 Sat ride_duration casual member 447637 341835 272. 645028. 0
The results of the t-test reveal a significant difference in ride duration means between members and casuals for each day of the week. This finding indicates that, regardless of the specific day of the week, casual users generally have longer ride durations compared to members. This suggests that casual riders tend to use bikes for extended periods throughout the entire week, while members exhibit greater ride duration consistency during weekdays with minimal fluctuations. Casual users experience their shortest ride duration on Thursday (23.1 minutes) and their longest on Sunday (31 minutes).
Notably, both member and casual users demonstrate extended average ride duration over the weekend. Contrary to my hypothesis (H9) that suggested a higher average ride duration for members on weekdays and lower on weekends compared to casuals, the data does not support this hypothesis. This discrepancy is another indicator that Cyclistic company should pay more attention to casual users. They may need to recheck their financial data to determine whether member users are the most profitable segment.
H10: Members have a ride peak at 8 mornings and at 17 afternoons during weekdays while casual users tend to ride more in the afternoon during weekends.
bike_cleaned_2 %>%
group_by(started_hour, customer_type) %>%
summarise(count_h = n()) %>%
ggplot(aes(x = started_hour, y = count_h, color = customer_type, group = customer_type)) +
geom_line(linewidth = 1) +
geom_point(size =2) +
scale_y_continuous(name = "Total Rides", labels = function(x) format(x, big.mark = ",", scientific = F)) +
coord_cartesian(ylim = c(0, 300000)) +
labs(x = "Hour of Day") +
theme_classic() +
ggtitle("Customers Total Rides by Hour of Day") +
theme(plot.title = element_text(hjust = 0.5, vjust =0.5, face = "bold", size =20))
## `summarise()` has grouped output by 'started_hour'. You can override using the
## `.groups` argument.
Based on the chart provided, a clear pattern emerges: member users exhibit peak ride activity during both mornings at 8 AM and afternoons at 5 PM, while casual users predominantly show a peak ride time exclusively afternoons at 5 PM. This observation lends support to the tenth hypothesis, H10: Members have a ride peak at 8 mornings and at 17 afternoons during weekdays while casual users tend to ride more in the afternoon during weekends.
H11: Members’ total rides per hour of the day are higher than casuals.
chisq.test(table(bike_cleaned_2$customer_type, bike_cleaned_2$started_hour))
##
## Pearson's Chi-squared test
##
## data: table(bike_cleaned_2$customer_type, bike_cleaned_2$started_hour)
## X-squared = 149434, df = 23, p-value < 2.2e-16
The results of the chi-square test indicate a p-value of less than 0.05, indicating a statistically significant difference in the preferred ride hours between members and casual users. Specifically, member riders have two distinct preferred periods: the first period is from 6 AM to 9 AM, and the second period is from 3 PM to 7 PM. In contrast, casual users exhibit a single preferred period, which is from 3 PM to 7 PM. Additionally, during the time period from 9 PM to 3 AM, casual users have higher ride frequencies than members. As a result, the eleventh hypothesis, H11: Members’ overall rides per hour of the day are higher than those of casuals, is not supported.
H12:Members’ average ride duration per hour of the day is higher than casuals.
aggregate(bike_cleaned_2, ride_duration~customer_type + started_hour, FUN = mean)
## customer_type started_hour ride_duration
## 1 casual 00 23.52395
## 2 member 00 12.19040
## 3 casual 01 22.78001
## 4 member 01 12.57688
## 5 casual 02 22.22165
## 6 member 02 12.54819
## 7 casual 03 22.42704
## 8 member 03 12.85106
## 9 casual 04 20.44658
## 10 member 04 11.65212
## 11 casual 05 19.58184
## 12 member 05 10.94873
## 13 casual 06 17.70604
## 14 member 06 11.85674
## 15 casual 07 18.44755
## 16 member 07 12.08027
## 17 casual 08 21.41048
## 18 member 08 12.08692
## 19 casual 09 27.19392
## 20 member 09 12.65344
## 21 casual 10 30.69946
## 22 member 10 13.48381
## 23 casual 11 31.59941
## 24 member 11 13.74236
## 25 casual 12 30.88798
## 26 member 12 13.59812
## 27 casual 13 31.77114
## 28 member 13 13.95049
## 29 casual 14 31.46794
## 30 member 14 14.36625
## 31 casual 15 29.86641
## 32 member 15 14.11615
## 33 casual 16 27.53829
## 34 member 16 14.02153
## 35 casual 17 25.34928
## 36 member 17 14.03994
## 37 casual 18 24.42931
## 38 member 18 13.83907
## 39 casual 19 24.77987
## 40 member 19 13.60101
## 41 casual 20 25.03570
## 42 member 20 13.36551
## 43 casual 21 24.16309
## 44 member 21 12.99250
## 45 casual 22 23.46371
## 46 member 22 12.67809
## 47 casual 23 23.27230
## 48 member 23 12.34701
bike_cleaned_2 %>%
group_by(started_hour, customer_type) %>%
summarise(mean_duration_h = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = started_hour, y = mean_duration_h, color = customer_type, group = customer_type)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(x = "Hour of Day", y = "Ride Duration Mean") +
theme_classic() +
ggtitle("Average Ride Duration by Hour of Day") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20))
## `summarise()` has grouped output by 'started_hour'. You can override using the
## `.groups` argument.
The chart provided reveals a distinct pattern: the average ride duration of members is lower than that of casual users across different hours of the day.
bike_cleaned_2 %>%
group_by(started_hour, customer_type) %>%
summarise(mean_duration_hh = round(mean(ride_duration), 1)) %>%
ggplot(aes(x = started_hour, y = mean_duration_hh, fill = customer_type)) +
geom_col(width = 0.9) +
scale_y_continuous(expand = c(0, 0)) +
labs(x = "Hour of Day", y = "Ride Duration Mean") +
theme_classic() +
ggtitle("Average Ride Duration by Hour of Day") +
theme(plot.title = element_text(hjust = 0.5, vjust = 0.5, face = "bold", size = 20), legend.position = "top", legend.justification = "left") +
geom_text(aes(label = mean_duration_hh), position = position_stack(vjust = 0.5), vjust = 0.5, size = 3.2)
## `summarise()` has grouped output by 'started_hour'. You can override using the
## `.groups` argument.
bike_cleaned_2 %>%
group_by(started_hour) %>%
rstatix::t_test(ride_duration ~ customer_type)
## # A tibble: 24 × 9
## started_hour .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 00 ride_dura… casual member 38885 21793 74.1 57095. 0
## 2 01 ride_dura… casual member 27676 13921 56.6 41339. 0
## 3 02 ride_dura… casual member 17307 7597 42.2 24749. 0
## 4 03 ride_dura… casual member 9092 4260 27.9 13057. 7.13e-166
## 5 04 ride_dura… casual member 6074 5472 23.9 8706. 9.99e-123
## 6 05 ride_dura… casual member 7888 22955 27.6 8601. 4.64e-161
## 7 06 ride_dura… casual member 18052 67255 31.2 19835. 1.28e-208
## 8 07 ride_dura… casual member 32818 120013 47.2 36012. 0
## 9 08 ride_dura… casual member 45557 133069 70.2 50303. 0
## 10 09 ride_dura… casual member 56625 100498 102. 65386. 0
## # ℹ 14 more rows
The results of the t-test highlight a significant disparity in ride duration means between members and casual users across every hour of the day. This finding underscores the fact that, irrespective of the specific time, casual users tend to have longer ride durations compared to members. This observation implies that casual riders frequently utilize bikes for extended periods throughout the entire day, while members exhibit a more consistent ride duration with minimal fluctuations. Notably, casual users experience their briefest ride duration at 6 am (17.7 minutes) and their lengthiest at 11 am (31.6 minutes).
In contrast to my hypothesis (H12), which posited that members’ average ride duration per hour of the day is higher than that of casuals, the data fails to support this conjecture. This discrepancy serves as a clear indicator that Cyclistic should allocate more attention to casual users. It may be necessary to review their financial data to ascertain whether member users indeed constitute the most profitable segment.
H13: Members’ starting and ending locations are clustered in the downtown area, while casual customers’ locations are more dispersed and along the beach.
start_station_map_member <- bike_cleaned_2 %>%
select(start_station_name,
start_lat,
start_lng,
customer_type) %>%
filter(customer_type =="member") %>%
group_by(start_station_name) %>%
mutate(num_rides = n()) %>%
distinct(start_station_name, .keep_all = TRUE)
arrange(start_station_map_member, desc(num_rides))
## # A tibble: 756 × 5
## # Groups: start_station_name [756]
## start_station_name start_lat start_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Clark St & Elm St 41.9 -87.6 member 23161
## 2 Wells St & Concord Ln 41.9 -87.6 member 21260
## 3 Kingsbury St & Kinzie St 41.9 -87.6 member 20289
## 4 Wells St & Elm St 41.9 -87.6 member 19130
## 5 Dearborn St & Erie St 41.9 -87.6 member 17801
## 6 St. Clair St & Erie St 41.9 -87.6 member 17517
## 7 Wells St & Huron St 41.9 -87.6 member 17336
## 8 Broadway & Barry Ave 41.9 -87.6 member 17128
## 9 Clark St & Armitage Ave 41.9 -87.6 member 15934
## 10 Theater on the Lake 41.9 -87.6 member 15933
## # ℹ 746 more rows
sum(start_station_map_member$num_rides)
## [1] 2386306
start_station_map_casual <- bike_cleaned_2 %>%
select( start_station_name,
start_lat,
start_lng,
customer_type) %>%
filter(customer_type =="casual") %>%
group_by(start_station_name) %>%
mutate(num_rides = n()) %>%
distinct(start_station_name, .keep_all = TRUE)
arrange(start_station_map_casual, desc(num_rides))
## # A tibble: 776 × 5
## # Groups: start_station_name [776]
## start_station_name start_lat start_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Streeter Dr & Grand Ave 41.9 -87.6 casual 60125
## 2 Millennium Park 41.9 -87.6 casual 30646
## 3 Michigan Ave & Oak St 41.9 -87.6 casual 28114
## 4 Lake Shore Dr & Monroe St 41.9 -87.6 casual 22743
## 5 Shedd Aquarium 41.9 -87.6 casual 21231
## 6 Theater on the Lake 41.9 -87.6 casual 20955
## 7 Wells St & Concord Ln 41.9 -87.6 casual 17746
## 8 Lake Shore Dr & North Blvd 41.9 -87.6 casual 16005
## 9 Indiana Ave & Roosevelt Rd 41.9 -87.6 casual 15954
## 10 Clark St & Lincoln Ave 41.9 -87.6 casual 15774
## # ℹ 766 more rows
sum(start_station_map_casual$num_rides)
## [1] 1956744
start_station_map <- start_station_map_casual %>%
union(start_station_map_member)
tail(start_station_map)
## # A tibble: 6 × 5
## # Groups: start_station_name [6]
## start_station_name start_lat start_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Narragansett & McLean 41.9 -87.8 member 1
## 2 WEST CHI-WATSON 41.9 -87.7 member 1
## 3 Kildare Ave & 26th St 41.8 -87.7 member 1
## 4 Panama Ave & Grace St 42.0 -87.8 member 1
## 5 Doty Ave & 111th St 41.7 -87.6 member 1
## 6 Central Park Ave & Douglas Blvd 41.9 -87.7 member 2
sum(subset(start_station_map, customer_type =="member")$num_rides)
## [1] 2386306
sum(subset(start_station_map, customer_type =="casual")$num_rides)
## [1] 1956744
start_bins <- seq(0, 70000, by = 1000)
density_color <- colorBin(palette = "magma", domain = start_station_map$num_rides, na.color = "transparent", bins = start_bins, reverse = T)
start_map_text <- paste("station name: ", start_station_map$start_station_name, "<br/>", "number of rides: ", start_station_map$num_rides, "<br/>", "customer type: ", start_station_map$customer_type, sep = "") %>%
lapply(htmltools::HTML)
tag.map.title <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 25%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
title <- htmltools::tags$div(
tag.map.title, htmltools::HTML("Popular Start Stations")
)
rides_per_station <- leaflet(start_station_map) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ start_lng, ~ start_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = start_station_map$num_rides/3000,
stroke = FALSE,
label = start_map_text,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
#add legend.
leaflet::addLegend(
pal = density_color,
values = ~ num_rides,
opacity = 0.8,
title = "Number of Rides",
position = "bottomright") %>%
leaflet::addControl(title, position = "topleft", className="map-title")
rides_per_station
The station that stands out as the most popular is Streeter Dr & Grand Ave. This particular station boasts proximity to the lakeside train connector, along with Jane Addams Memorial Park and Ohio Beach Street. It is highly likely to be frequented by tourists and weekend visitors. Among the top ten stations, six are conveniently located next to public parks, three are situated in front of sizable residential buildings, and two are in close proximity to fitness centers.
density_color_member <- colorBin(palette = "magma", domain = start_station_map_member$num_rides, na.color = "transparent", bins = start_bins, reverse = T)
density_color_casual <- colorBin(palette = "magma", domain = start_station_map_casual$num_rides, na.color = "transparent", bins = start_bins, reverse = T)
start_map_text_member <- paste("station name: ", start_station_map_member$start_station_name, "<br/>", "number of rides: ", start_station_map_member$num_rides, sep = "") %>%
lapply(htmltools::HTML)
start_map_text_casual <- paste("station name: ", start_station_map_casual$start_station_name, "<br/>", "number of rides: ", start_station_map_casual$num_rides, sep = "") %>%
lapply(htmltools::HTML)
tag.map.title <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 25%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
member_map_title <- htmltools::tags$div(
tag.map.title, htmltools::HTML("Popular Stations for Member User")
)
rides_per_station_member <- leaflet(start_station_map_member) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ start_lng, ~ start_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color_member(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = start_station_map_member$num_rides/3000,
stroke = FALSE,
label = start_map_text_member,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto"))
#add legend.
#addLegend(
#pal = density_color_member,
# values = ~ num_rides,
#opacity = 0.8,
#title = "Number of Rides",
#position = "bottomright")
rides_per_station_member
tag.casual.title <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 65%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
casual_map_title <- htmltools::tags$div(
tag.casual.title, htmltools::HTML("Casual Users")
)
rides_per_station_casual <- leaflet(start_station_map_casual) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ start_lng, ~ start_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color_casual(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = start_station_map_casual$num_rides/3000,
stroke = FALSE,
label = start_map_text_casual,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
#add legend.
leaflet::addLegend(
pal = density_color_casual,
values = ~ num_rides,
opacity = 0.8,
title = "Number of Rides",
position = "bottomright") %>%
leaflet::addControl(casual_map_title, position = "topleft", className="map-title")
rides_per_station_casual
sync_maps <- sync(rides_per_station_member, rides_per_station_casual)
sync_maps
Through the maps above, we can clearly see that for casual riders the areas of their interest are located around the city center and along the beach where the most cultural & leisure points are.
In contrast, members’ bike usage during the week is less dense in tourist areas but instead, it’s quite heavy around Chicago’s downtown area which could also prove our hypothesis that most of the annual members commute daily to work.
For the starting stations, Casual riders often started from aquariums, the vicinity of museums, parks, beaches, and harbor points. In contrast, members often started from stations close to universities, residential areas, restaurants, hospitals, grocery stores, etc.
end_station_map_member <- bike_cleaned_2 %>%
select(end_station_name,
end_lat,
end_lng,
customer_type) %>%
filter(customer_type =="member") %>%
group_by(end_station_name) %>%
mutate(num_rides = n()) %>%
distinct(end_station_name, .keep_all = TRUE)
arrange(end_station_map_member, desc(num_rides))
## # A tibble: 754 × 5
## # Groups: end_station_name [754]
## end_station_name end_lat end_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Clark St & Elm St 41.9 -87.6 member 23484
## 2 Wells St & Concord Ln 41.9 -87.6 member 21886
## 3 Kingsbury St & Kinzie St 41.9 -87.6 member 20723
## 4 Wells St & Elm St 41.9 -87.6 member 19564
## 5 Dearborn St & Erie St 41.9 -87.6 member 18425
## 6 St. Clair St & Erie St 41.9 -87.6 member 17724
## 7 Broadway & Barry Ave 41.9 -87.6 member 17436
## 8 Wells St & Huron St 41.9 -87.6 member 16724
## 9 Clark St & Armitage Ave 41.9 -87.6 member 15304
## 10 Clark St & Lincoln Ave 41.9 -87.6 member 15017
## # ℹ 744 more rows
sum(end_station_map_member$num_rides)
## [1] 2386306
end_station_map_casual <- bike_cleaned_2 %>%
select(end_station_name,
end_lat,
end_lng,
customer_type) %>%
filter(customer_type =="casual") %>%
group_by(end_station_name) %>%
mutate(num_rides = n()) %>%
distinct(end_station_name, .keep_all = TRUE)
arrange(end_station_map_casual, desc(num_rides))
## # A tibble: 774 × 5
## # Groups: end_station_name [774]
## end_station_name end_lat end_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Streeter Dr & Grand Ave 41.9 -87.6 casual 62556
## 2 Millennium Park 41.9 -87.6 casual 32285
## 3 Michigan Ave & Oak St 41.9 -87.6 casual 29689
## 4 Theater on the Lake 41.9 -87.6 casual 22753
## 5 Lake Shore Dr & Monroe St 41.9 -87.6 casual 21606
## 6 Shedd Aquarium 41.9 -87.6 casual 19624
## 7 Lake Shore Dr & North Blvd 41.9 -87.6 casual 19032
## 8 Wells St & Concord Ln 41.9 -87.6 casual 17656
## 9 Clark St & Lincoln Ave 41.9 -87.6 casual 16284
## 10 Wabash Ave & Grand Ave 41.9 -87.6 casual 16068
## # ℹ 764 more rows
sum(end_station_map_casual$num_rides)
## [1] 1956744
end_station_map <- end_station_map_casual %>%
union(end_station_map_member)
arrange(end_station_map, desc(num_rides))
## # A tibble: 1,528 × 5
## # Groups: end_station_name [776]
## end_station_name end_lat end_lng customer_type num_rides
## <chr> <dbl> <dbl> <chr> <int>
## 1 Streeter Dr & Grand Ave 41.9 -87.6 casual 62556
## 2 Millennium Park 41.9 -87.6 casual 32285
## 3 Michigan Ave & Oak St 41.9 -87.6 casual 29689
## 4 Clark St & Elm St 41.9 -87.6 member 23484
## 5 Theater on the Lake 41.9 -87.6 casual 22753
## 6 Wells St & Concord Ln 41.9 -87.6 member 21886
## 7 Lake Shore Dr & Monroe St 41.9 -87.6 casual 21606
## 8 Kingsbury St & Kinzie St 41.9 -87.6 member 20723
## 9 Shedd Aquarium 41.9 -87.6 casual 19624
## 10 Wells St & Elm St 41.9 -87.6 member 19564
## # ℹ 1,518 more rows
sum(subset(end_station_map, customer_type =="member")$num_rides)
## [1] 2386306
sum(subset(end_station_map, customer_type =="casual")$num_rides)
## [1] 1956744
Among the top ten stations, six are conveniently located next to public parks, three are situated in front of sizable residential buildings, and two are in close proximity to fitness centers.
end_bins <- seq(0, 63000, by = 1000)
density_color_end <- colorBin(palette = "magma", domain = end_station_map$num_rides, na.color = "transparent", bins = end_bins, reverse = T)
end_map_text <- paste("station name: ", end_station_map$end_station_name, "<br/>", "number of rides: ", end_station_map$num_rides, "<br/>", "customer type: ", end_station_map$customer_type, sep = "") %>%
lapply(htmltools::HTML)
tag.map.title_end <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 25%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
title_end <- htmltools::tags$div(
tag.map.title, htmltools::HTML("Popular End Stations")
)
rides_end_station <- leaflet(end_station_map) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ end_lng, ~ end_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color_end(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = end_station_map$num_rides/3000,
stroke = FALSE,
label = end_map_text,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
#add legend.
leaflet::addLegend(
pal = density_color_end,
values = ~ num_rides,
opacity = 0.8,
title = "Number of Rides",
position = "bottomright") %>%
leaflet::addControl(title_end, position = "topleft", className="map-title")
rides_end_station
The end station that stands out as the most popular is also Streeter Dr & Grand Ave.
density_color_end_member <- colorBin(palette = "magma", domain = end_station_map_member$num_rides, na.color = "transparent", bins = end_bins, reverse = T)
density_color_end_casual <- colorBin(palette = "magma", domain = end_station_map_casual$num_rides, na.color = "transparent", bins = end_bins, reverse = T)
end_map_text_member <- paste("station name: ", end_station_map_member$end_station_name, "<br/>", "number of rides: ", end_station_map_member$num_rides, sep = "") %>%
lapply(htmltools::HTML)
end_map_text_casual <- paste("station name: ", end_station_map_casual$end_station_name, "<br/>", "number of rides: ", end_station_map_casual$num_rides, sep = "") %>%
lapply(htmltools::HTML)
tag.member_end.title <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 25%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
member_end_map_title <- htmltools::tags$div(
tag.member_end.title, htmltools::HTML("Popular End Stations (member)")
)
rides_end_station_member <- leaflet(end_station_map_member) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ end_lng, ~ end_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color_end_member(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = end_station_map_member$num_rides/3000,
stroke = FALSE,
label = end_map_text_member,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto"))
#addControl(member_end_map_title, position = "topleft", className="map-title")
#dd legend.
#addLegend(
# pal = density_color_end_member,
#values = ~ num_rides,
# opacity = 0.8,
#title = "Number of Rides",
# position = "bottomright")
rides_end_station_member
tag.casual_end.title <- htmltools::tags$style(htmltools::HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 60%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 18px;
color: blue
}
"))
casual_end_map_title <- htmltools::tags$div(
tag.casual_end.title, htmltools::HTML("Casual End Stations")
)
rides_end_station_casual <- leaflet(end_station_map_casual) %>%
leaflet::addTiles() %>%
leaflet::setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% # Chicago coordinates
leaflet::addProviderTiles("Esri.WorldGrayCanvas") %>% # set map style
leaflet::addCircleMarkers(~ end_lng, ~ end_lat, # add circle markers for each station, set fill color, add tooltip.
fillColor = ~ density_color_end_casual(num_rides),
fillOpacity = 1,
color = "white",
# radius = ~ start_station_map$num_rides/1500, #this is an option for variable marker size based on number of rides.
radius = end_station_map_casual$num_rides/3000,
stroke = FALSE,
label = end_map_text_casual,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
#add legend.
leaflet::addLegend(
pal = density_color_end_casual,
values = ~ num_rides,
opacity = 0.8,
title = "Number of Rides",
position = "bottomright") %>%
leaflet::addControl(casual_end_map_title, position = "topleft", className="map-title")
rides_end_station_casual
sync_maps_end <- sync(rides_end_station_member, rides_end_station_casual)
sync_maps_end
For the ended stations, we have similar findings. Casual riders tend to end their trip near museums and other attraction sites while members often end their journey close to universities, and residential and commercial areas.
The thirteen hypothesis, H13: members’ starting and ending locations are clustered in the downtown area, while casual customers’ locations are more dispersed and along the beach, was supported.
Through the data analysis, we can confidently say that members and casuals are two different customer groups.
Firstly, it’s important to recognize that members and casual customers utilize Cyclistic bikes for distinct purposes. Members use the bikes for their daily commutes, whereas casual customers use them primarily for sightseeing around Chicago. Consequently, members exhibit higher bike usage frequency, with ride durations roughly half that of casual riders.
Furthermore, there are notable differences in the start and end locations of their journeys. Members typically initiate and conclude their trips in proximity to universities, residential areas, and commercial districts. On the other hand, casual customers tend to start and finish their rides near parks, museums, and coastal areas.
These variations in bike usage patterns, trip durations, and preferred locations should be taken into account while devising strategies to target and engage both member and casual customer segments effectively. It seems that converting casual customers to members will prove to be the most challenging task as both groups have totally different preferences. Although Cyclistic can attempt to develop a campaign to convert casual customers into members, the likelihood of achieving success seems rather low.
Based on the analysis, it is evident that casual riders predominantly use bike-sharing for leisure and tourism, with heightened activity during weekends. On the other hand, members primarily use bike-sharing for their daily work commute, showing increased activity on weekdays. Considering these findings, Cyclistic should explore offering new membership options tailored to weekend riders, family memberships (as families often spend weekends together), or partnerships with museums, theaters, and other popular locations frequently visited by casual riders.
The data indicates that casual riders take significantly longer trips compared to members. To capitalize on this insight, Cyclistic can consider introducing bonuses or rewards for riders who take longer trips, encouraging extended usage.
Bike usage peaks during June to August, coinciding with the summer season. With this in mind, Cyclistic should initiate marketing campaigns in spring, offering early-bird discounts for the new membership types. These campaigns should continue during the peak summer months to attract and retain customers.
“Despite the challenges, all hope is not lost for Cyclistic, as the analysis results provide a foundation for several revenue-boosting actions. To address the initial questions more succinctly, we can rephrase them as, ‘How can Cyclistic increase its revenue based on the available data?’ Additionally, let’s not forget the last original question: ‘How can the marketing team leverage digital media to maximize the number of members?’
Create targeted social media campaigns aimed at casual customers on weekends, for example, offering trial benefits to casual users with an average monthly ride duration of 25 minutes. Target users near the top 25 stations and run the campaign from mid-spring to mid-autumn, with special promotions for weekend trips and classic bike riders. Regarding digital media utilization, Cyclistic can craft more focused and effective campaigns, for instance, a ‘Weekend Experience with Cyclistic’ social media campaign could incentivize them to share their positive experiences for future discounts or lower membership fees. Encouraging resharing would expand reach and attract potential customers within a limited timeframe.
Furthermore, expansion opportunities could be explored by analyzing customers’ home addresses, identifying tourist hotspots. By establishing branches in these areas, Cyclistic can attract both local residents and tourists as members, enhancing daily commute and sightseeing experiences.
A more flexible pricing structure tailored to weekend users could also entice more customers to become members. A membership plan accommodating weekend-only bike usage would appeal to those seeking occasional rides.
Cyclistic could offer promotions to casual users who use Cyclistic for work commutes or create a corporate package for companies near the top 25 stations, providing Cyclistic rides as an employee benefit. The campaign should run from mid-spring to mid-autumn, targeting users near the top 25 stations, with special promotions for weekdays and electric and classic bike riders.
Recognize that members often take round trips, and many casual users may exhibit similar behavior but have not yet become members. Develop a campaign for users who use the service twice a day, offering a special promotion for becoming members. Target users near the top 25 stations and run the campaign from mid-spring to mid-autumn, focusing on weekdays and classic bike riders.
By implementing these strategies, Cyclistic can leverage the data insights to attract more members, enhance customer engagement, and ultimately increase revenue. Additionally, providing stakeholders with additional information such as user IDs, age, gender, and price could further refine the analysis and lead to even more targeted marketing efforts.
In conclusion, the Cyclistic case study demonstrates that casual customers and members constitute distinct customer groups, making conversion challenging. However, leveraging the available data, Cyclistic can implement targeted actions to maximize revenue and cater to both segments effectively.
Working on this capstone project has been an enriching experience, providing valuable insights into real-world data analysis, including data preparation and cleaning. Ensuring data reliability and integrity emerged as vital aspects before conducting the analysis. Additionally, visualizing diverse data metrics led to fascinating discoveries, informing sound business decisions.
Ultimately, the fulfillment lies in using data to guide decisions that enhance people’s lives. I look forward to future opportunities to delve into further analytical pursuits. Until then, happy analyzing!”